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Preface 



Industrial robots carry out simple tasks in customized environments for which 
it is typical that nearly all effector movements can be planned during an off- 
line phase. A continual control based on sensory feedback is at most necessary 
at effector positions near target locations utilizing torque or haptic sensors. 
It is desirable to develop new-generation robots showing higher degrees of 
autonomy for solving high-level deliberate tasks in natural and dynamic envi- 
ronments. Obviously, camera-equipped robot systems, which take and process 
images and make use of the visual data, can solve more sophisticated robotic 
tasks. The development of a (semi-) autonomous camera-equipped robot must 
be grounded on an infrastructure, based on which the system can acquire 
and/or adapt task-relevant competences autonomously. This infrastructure 
consists of technical equipment to support the presentation of real world 
training samples, various learning mechanisms for automatically acquiring 
function approximations, and testing methods for evaluating the quality of 
the learned functions. Accordingly, to develop autonomous camera-equipped 
robot systems one must first demonstrate relevant objects, critical situations, 
and purposive situation-action pairs in an experimental phase prior to the 
application phase. Secondly, the learning mechanisms are responsible for ac- 
quiring image operators and mechanisms of visual feedback control based on 
supervised experiences in the task-relevant, real environment. 

This paradigm of learning-based development leads to the concepts of 
compatibilities and manifolds. Compatibilities are general constraints on the 
process of image formation which hold more or less under task-relevant 
or accidental variations of the imaging conditions. Based on learned de- 
grees of compatibilities, one can choose those image operators together with 
parametrizations, which are expected to be most adequate for treating the 
underlying task. On the other hand, significant variations of image features 
are represented as manifolds. They may originate from changes in the spatial 
relation among robot effectors, cameras, and environmental objects. Learned 
manifolds are the basis for acquiring image operators for task-relevant object 
or situation recognition. The image operators are constituents of task-specific, 
behavioral modules which integrate deliberate strategies and visual feedback 
control. The guiding line for system development is that the resulting behav- 
iors should meet requirements such as task-relevance, robustness, flexibility. 
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time limitation, etc. simultaneously. All principles to be presented in the work 
are based on real scenes of man-made objects and a multi-component robot 
system consisting of robot arm, head, and vehicle. A high-level application is 
presented that includes sub-tasks such as localizing, approaching, grasping, 
and carrying objects. 



Acknowledgements 

Since 1993 I have been a member of the Kognitive Systeme Gruppe in Kiel. 
I am most grateful to G. Sommer, the head of this group, for the contin- 
ual advice and support. I have learned so much from his wide spectrum of 
scientific experience. 

A former version of this book has been submitted and accepted as ha- 
bilitation thesis at the Institut fiir Informatik und Praktische Mathematik, 
Technische Fakultat of the Ghristian-Albrechts-Universitat, in Kiel. I’m very 
grateful to the six persons who have been responsible for assessing the 
work. These are: V. Hlavac from the Technical University of Prague in the 
Gzech Republic, R. Klette from the University of Auckland in New Zealand, 
G.-E. Liedtke from the Universitat Hannover, G. Sommer and A. Srivastav 
from the Ghristian-Albrechts-Universitat Kiel, and F. Wahl from the Tech- 
nische Universitat Braunschweig. Deepest thanks also for the great interest 
in my work. 

I appreciate the discussions with former and present colleagues J. Bruske, 

T. Billow, K. Daniilidis, M. Felsberg, M. Hansen, N. Kruger, V. Kruger, 

U. Mahlmeister, G. Perwass, B. Rosenhahn, W. Yu, and my students 
M. Benkwitz, A. Bunten, S. Kunze, F. Lempelius, M. Paschke, A. Schmidt, 
W. Timm, and J. Troster. 

Technical support was provided by A. Bunten, G. Diesner, and H. Schmidt. 
My thanks to private individuals have been expressed personally. 



March 2001 



Josef Pauli 




Contents 



1. Introduction 1 

1.1 Need for New-Generation Robot Systems 1 

1.2 Paradigms of Computer Vision (CV) and Robot Vision (RV) . 5 

1.2.1 Characterization of Computer Vision 5 

1.2.2 Characterization of Robot Vision 8 

1.3 Robot Systems versus Autonomous Robot Systems 10 

1.3.1 Characterization of a Robot System 10 

1.3.2 Characterization of an Autonomous Robot System. ... 11 

1.3.3 Autonomous Camera-Equipped Robot System 14 

1.4 Important Role of Demonstration and Learning 15 

1.4.1 Learning Feature Compatibilities under Real Imaging . 15 

1.4.2 Learning Feature Manifolds of Real World Situations . 18 

1.4.3 Learning Environment-Effector-Image Relationships . . 20 

1.4.4 Compatibilities, Manifolds, and Relationships 21 

1.5 Chapter Overview of the Work 23 

2. Compatibilities for Object Boundary Detection 25 

2.1 Introduction to the Chapter 25 

2.1.1 General Context of the Chapter 25 

2.1.2 Object Localization and Boundary Extraction 27 

2.1.3 Detailed Review of Relevant Literature 28 

2.1.4 Outline of the Sections in the Chapter 31 

2.2 Geometric/Photometric Compatibility Principles 32 

2.2.1 Hough Transformation for Line Extraction 32 

2.2.2 Orientation Compatibility between Lines and Edges . . 34 

2.2.3 Junction Compatibility between Pencils and Corners . . 41 

2.3 Compatibility-Based Structural Level Grouping 46 

2.3.1 Hough Peaks for Approximate Parallel Lines 47 

2.3.2 Phase Compatibility between Parallels and Ramps. ... 49 

2.3.3 Extraction of Regular Quadrangles 54 

2.3.4 Extraction of Regular Polygons 61 

2.4 Compatibility-Based Assembly Level Grouping 69 

2.4.1 Focusing Image Processing on Polygonal Windows .... 70 

2.4.2 Vanishing-Point Compatibility of Parallel Lines 74 




VIII Contents 



2.4.3 Pencil Compatibility of Meeting Boundary Lines 76 

2.4.4 Boundary Extraction for Approximate Polyhedra 78 

2.4.5 Geometric Reasoning for Boundary Extraction 79 

2.5 Visual Demonstrations for Learning Degrees of Compatibility . 85 

2.5.1 Learning Degree ofLine/Edge Orientation Compatibility 85 

2.5.2 Learning Degree of Parallel/Ramp Phase Compatibility . 90 

2.5.3 Learning Degree of Parallelism Compatibility 95 

2.6 Summary and Discussion of the Chapter 96 

3. Manifolds for Object and Situation Recognition 101 

3.1 Introduction to the Chapter 101 

3.1.1 General Context of the Chapter 101 

3.1.2 Approach for Object and Situation Recognition 102 

3.1.3 Detailed Review of Relevant Literature 103 

3.1.4 Outline of the Sections in the Chapter 108 

3.2 Learning Pattern Manifolds with GBFs and PCA 108 

3.2.1 Compatibility and Discriminability for Recognition . . . 108 

3.2.2 Regularization Principles and GBF Networks Ill 

3.2.3 Canonical Frames with Principal Component Analysis . 116 

3.3 GBF Networks for Approximation of Recognition Functions . . 122 

3.3.1 Approach of GBF Network Learning for Recognition . . 122 

3.3.2 Object Recognition under Arbitrary View Angle 124 

3.3.3 Object Recognition for Arbitrary View Distance 129 

3.3.4 Scoring of Grasping Situations 131 

3.4 Sophisticated Manifold Approximation for Robust Recognition . 133 

3.4.1 Making Manifold Approximation Tractable 134 

3.4.2 Log-Polar Transformation for Manifold Simplification . 137 

3.4.3 Space-Time Correlations for Manifold Refinement .... 145 

3.4.4 Learning Strategy with PCA/GBF Mixtures 154 

3.5 Summary and Discussion of the Chapter 168 

4. Learning-Based Achievement of RV Competences 171 

4.1 Introduction to the Chapter 171 

4.1.1 General Context of the Chapter 171 

4.1.2 Learning Behavior-Based Systems 174 

4.1.3 Detailed Review of Relevant Literature 178 

4.1.4 Outline of the Sections in the Chapter 182 

4.2 Integrating Deliberate Strategies and Visual Feedback 183 

4.2.1 Dynamical Systems and Control Mechanisms 183 

4.2.2 Generic Modules for System Development 197 

4.3 Treatment of an Exemplary High-Level Task 206 

4.3.1 Description of an Exemplary High-Level Task 206 

4.3.2 Localization of a Target Object in the Image 208 

4.3.3 Determining and Reconstructing Obstacle Objects .... 213 

4.3.4 Approaching and Grasping Obstacle Objects 219 




Contents 



IX 



4.3.5 Clearing Away Obstacle Objects on a Parking Area. . . 225 

4.3.6 Inspection and/or Manipulation of a Target Object . . . 231 

4.3.7 Monitoring the Task-Solving Process 237 

4.3.8 Overall Task-Specific Configuration of Modules 238 

4.4 Basic Mechanisms for Camera-Robot Coordination 240 

4.4.1 Camera-Manipulator Relation for One-Step Control . . 240 

4.4.2 Camera-Manipulator Relation for Multi-step Control . 245 

4.4.3 Hand Servoing for Determining the Optical Axis 248 

4.4.4 Determining the Field of Sharp View 250 

4.5 Summary and Discussion of the Chapter 252 

5. Summary and Discussion 255 

5.1 Developing Camera-Equipped Robot Systems 255 

5.2 Rationale for the Contents of This Work 258 

5.3 Proposals for Future Research Topics 260 

Appendix 1: Ellipsoidal Interpolation 263 

Appendix 2: Further Behavioral Modules 265 

Symbols 269 

Index 273 

References 277 




1. Introduction 



The first chapter presents an extensive introduction to the book by start- 
ing with the motivation. Next, the Robot Vision paradigm is characterized 
and confronted with the field of Computer Vision. Robot Vision is the indis- 
putable kernel of Autonomous Camera- Equipped Robot Systems. For the de- 
velopment of such new-generation robot systems the important role of visual 
demonstration and learning is explained. The final section gives an overview 
to the chapters of the book. 



1.1 Need for New- General ion Robot Systems 

We briefly describe present state and problems of robotics, give an outlook 
on trends of research and development, and summarize the specific novelty 
contributed in this book. 

Present State of Robotics 

Industrial robots carry out recurring simple tasks in a fast, accurate and 
reliable manner. This is typically the case in applications of series production. 
The environment is customized in relation to a fixed location and volume 
occupied by the robot and/or the robot is built such that certain spatial 
relations with a fixed environment are kept. Task-relevant effector trajectories 
must be planned perfectly during an offline phase and unexpected events 
must not occur during the subsequent online phase. Close-range sensors are 
utilized (if at all) for a careful control of the effectors at the target positions. 
Generally, sophisticated perception techniques and learning mechanisms, e.g. 
involving Computer Vision and Neural Networks, are unnecessary due to 
customized relations between robot and environment. 

In the nineteen eighties and nineties impressive progress has been achieved 
in supporting the development and programming of industrial robots. 

• CAD (Computer Aided Design) tools are used for convenient and rapid 
designing of the hardware of robot components, for example, shape and 
size of manipulator links, degrees-of-freedom of manipulator joints, etc. 
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• Application-specific signal processors are responsible for the control of the 
motors of the joints and thus cope with the dynamics of articulated robots. 
By solving the inverse kinematics the effectors can be positioned up to 
sub-millimeter accuracy, and the accuracy does not degrade even for high 
frequencies of repetition. 

• High-level robot programming languages are available to develop spe- 
cific programs for executing certain effector trajectories. There are several 
methodologies to automate the work of programming. 

• Teach-in techniques rely on a demonstration of an effector trajectory, which 
is executed using a control panel, and the course of effector coordinates is 
memorized and transformed into a sequence of program-steps. 

• Automatic planning systems are available which generate robot programs 
for assembly or disassembly tasks, e.g. sequences of movement steps of the 
effector for assembling complex objects out of components. These systems 
assume that initial state and desired state of the task are known accurately. 

• Appropriate control mechanisms are applicable to fine-control the effectors 
at the target locations. It is based on sensory feedback from close-range 
sensors, e.g. torque or haptic sensors. 

This development kit consisting of tools, techniques and mechanisms is 
widely available for industrial robots. Despite of that, there are serious limita- 
tions concerning the possible application areas of industrial robots. In the fol- 
lowing, problems and requirements in robotics are summarized, which serves 
as a motivation for the need for advanced robot systems. The mentioned 
development kit will be a part of a more extensive infrastructure which is 
necessary for the creation and application of new-generation robots. 

Problems and Requirements in Robotics 

The lack of a camera subsystem and of a conception for making extensive use 
of environmental data is a source of many limitations in industrial robots. 

• In the long term an insidious wear of robot components will influence 
the manufacturing process in an unfavourable manner which may lead 
to unusable products. For exceptional cases the robot effector may even 
damage certain environmental components of the manufacturing plant. 

• Exceptional, non-deterministic incidents with the robot or in the environ- 
ment, e.g. break of the effector or dislocated object arrangement, need to 
be recognized automatically in order to stop the robot and/or adapt the 
planned actions. 

• In series production the variance of geometric attributes must be tight re- 
spectively from object to object in the succession, e.g. nearly constant size, 
position, and orientation. Applications of frequently changing situations, 
e.g. due to the object variety, can not be treated by industrial robots. 

• The mentioned limitations will cause additional costs which contribute to 
the overall manufacturing expenses. These costs can be traced back to 
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the production of unusable products, the loss of production due to offline 
adaptation, the damage of robot equipment, etc. 

The main methodology to overcome these problems is to perceive the en- 
vironment continually and make use of the reconstructed spatial relations 
between robot effector and target objects. In addition to the close-range 
sensors one substantially needs long-range perception devices such as video, 
laser, infrared, and ultrasonic cameras. The long-range characteristic of cam- 
eras is appreciated for early measuring effector-object relations in order to 
adapt the effector movement timely (if needed). The specific limitations and 
constraints, which are inherent in the different perception devices, can be 
compensated by a fusion of the different image modalities. Furthermore, it is 
advantageous to utilize steerable cameras which provide the opportunity to 
control external and internal degrees-of-freedom such as pan, tilt, vergence, 
apperture, focus, and zoom. Image analysis is the basic means for the primary 
goal of reconstructing the effector-object relations, but also the prerequisite 
for the secondary goal of information fusion and camera control. To be really 
useful, the image analysis system must extract purposive information in the 
available slice of time. 

The application of camera-equipped robots (in contrast to blind industrial 
robots) could lead to damage prevention, flexibility increase, cost reduction, 
etc. However, the extraction of relevant image information and the construc- 
tion of adequate image-motor mappings for robot control causes tremendous 
difficulties. Generally, it is hard if not impossible to proof the correctness 
of reconstructed scene information and the goal-orientedness of image-motor 
mappings. This is the reason why the development and application of camera- 
equipped robots is restricted to (practically oriented) research institutes. So 
far, industries still avoid their application, apart from some exceptional cases. 
The components and dynamics of more or less natural environments are too 
complex and therefore imponderabilities will occur which can not be con- 
sidered in advance. More concretely, quite often the procedures to be pro- 
grammed for image analysis and robot control are inadequate, non-stable, 
inflexible, and inefficient. Consequently, the development and application of 
new-generation robots must be grounded on a learning paradigm. For sup- 
porting the development of autonomous camera-equipped robot systems, the 
nascent robot system must be embedded in an infrastructure, based on which 
the system can learn task-relevant image operators and image-motor map- 
pings. In addition to that, the robot system must be willing to make life-long 
experience and adapt the behaviors for new environments. 

New Application Areas for Camera-Equipped Robot Systems 

In our opinion, leading-edge robotics institutes agree with the presented cat- 
alog of problems and requirements. Pursuing the mission to strive for au- 
tonomous robots each institute individually focuses on and treats some of 
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the problems in detail. As a result, new-generation robot systems are be- 
ing developed, which can be regarded as exciting prototype solutions. Fre- 
quently, the robots show increased robustness and flexibility for tasks which 
need to be solved in non-customized environments. The robots behaviors are 
purposive despite of large variations of environmental situations or even in 
cases of exceptional, non-deterministic incidents. Consequently, by applying 
new-generation robots to classical tasks (up to now performed by industrial 
robots), it should be possible to relax the customizing of the environment. 
For example, the manufacturing process can be organized more flexible with 
the purpose of increasing the product variety.^ 

Beyond manufacturing plants, which are typical environments of indus- 
trial robots, the camera-equipped robots should be able to work purposive 
in completely different (more natural) environments and carrying out new 
categories of tasks. Examples of such tasks include supporting disabled per- 
sons at home, cleaning rooms in office buildings, doing work in hazardous 
environments, automatic modeling of real objects or scenes, etc. These tasks 
have in common that objects or scenes must be detected in the images and 
reconstructed with greater or lesser degree of detail. For this purpose the 
agility of a camera-equipped robot is exploited in order to take environmen- 
tal images under controlled camera motion. The advantages are manifold, for 
example, take a degenerate view to simplify specific inspection tasks, take 
various images under several poses to support and verify object recognition, 
take an image sequence under continual view variation for complete object 
reconstruction. 

The previous discussion presented an idea of the wide spectrum of po- 
tential application areas for camera-equipped robot systems. Unfortunately, 
despite of encouraging successes achieved by robotics institutes, there are 
still tremendous difficulties in creating really usable camera-equipped robot 
systems. In practical applications these robot systems are lacking correct and 
goal-oriented image-motor mappings. This finding can be traced back to the 
lack of correctness of image processing, feature extraction, and reconstructed 
scene information. We have to have new conceptions for the development and 
evaluation of image analysis methods and image-motor mappings. 

Contribution and Novelty of This Book 

This work introduces a practical methodology for developing autonomous 
camera-equipped robot systems which are intended to solve high-level, de- 
liberate tasks. The development is grounded on an infrastructure, based on 

^ The german engineer newspaper VDI-Nachrichten reported in the issue of Febru- 
ary 18, 2000, that BMW intends to make investments of 30 billion Deutsche 
Marks for the development of highly flexible manufacturing plants. A major aim 
is to develop and apply more flexible robots which shonld be able to simnltane- 
onsly bnild different versions of BMW cars on each manufacturing plant. The 
spectrum of car variety must not be limited by unflexible manufacturing plants, 
but should only depend on specific demands on the market. 
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which the system can learn competences by interaction with the real task- 
relevant world. The infrastructure consists of technical equipment to support 
the demonstration of real world training samples, various learning mecha- 
nisms for automatically acquiring function approximations, and testing meth- 
ods for evaluating the quality of the learned functions. Accordingly, the appli- 
cation phase must be preceded by an experimental phase in order to construct 
image operators and servoing procedures, on which the task-solving process 
mainly relies. Visual demonstration and neural learning is the backbone for 
acquiring the situated competences in the real environment. 

This paradigm of learning-based development distinguishes between two 
learnable categories: compatibilities and manifolds. Compatibilities are gen- 
eral constraints on the process of image formation, which do hold to a certain 
degree. Based on learned degrees of compatibilities, one can choose those im- 
age operators together with parametrizations, which are expected to be most 
adequate for treating the underlying task. On the other hand, significant 
variations of image features are represented as manifolds. They may orig- 
inate from changes in the spatial relation among robot effectors, cameras, 
and environmental objects. Learned manifolds are the basis for acquiring im- 
age operators for task-relevant object or situation recognition. The image 
operators are constituents of task-specific, behavioral modules which inte- 
grate deliberate strategies and visual feedback control. As a summary, useful 
functions for image processing and robot control can be developed on the 
basis of learned compatibilities and manifolds. 

The practicality of this development methodology has been verified in 
several applications. In the book, we present a structured application that 
includes high-level sub-tasks such as localizing, approaching, grasping, and 
carrying objects. 



1.2 Paradigms of Computer Vision (CV) and Robot 
Vision (RV) 

The section cites well-known definitions of Computer Vision and characterizes 
the new methodology of Robot Vision. 

1.2.1 Characterization of Computer Vision 

Almost 20 years ago, Ballard and Brown introduced a definition for the term 
Computer Vision which was commonly accepted until present time [11]. 

Definition 1.1 (Computer Vision, according to Ballard) Computer 
Vision is the eonstruction of explicit, meaningful descriptions of physical ob- 
jects from images. Image processing, which studies image-to-image transfor- 
mations, is the basis for explicit description building. The challenge of Com- 
puter Vision is one of explicitness. Explicit descriptions are a prerequisite for 
recognizing, manipulating, and thinking about objects. 
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In the nineteen eighties and early nineties the research on Artificial In- 
telligence influenced the Computer Vision community [177]. According to 
the principle of Artificial Intelligence, both common sense and application- 
specific knowledge are represented explicitly, and reasoning mechanisms are 
applied {e.g. based on predicate calculus) to obtain a problem solver for a spe- 
cific application area [119]. According to this, explicitness is essential in both 
Artificial Intelligence and Computer Vision. This coherence inspired Haralick 
and Shapiro to a definition of Computer Vision which uses typical terms of 
Artificial Intelligence [73]. 

Definition 1.2 (Computer Vision, according to Haralick) Computer 
Vision is the combination of image processing, pattern recognition, and ar- 
tificial intelligence technologies which focuses on the computer analysis of 
one or more images, taken with a singleband/multiband sensor, or taken in 
time sequence. The analysis recognizes and locates the position and orienta- 
tion, and provides a sufficiently detailed symbolic description or recognition 
of those imaged objects deemed to be of interest in the three-dimensional en- 
vironment. The Computer Vision process often uses geometric modeling and 
complex knowledge representations in an expectation- or model-based match- 
ing or searching methodology. The searching can include bottom-up, top-down, 
blackboard, hierarchical, and heterarchical control strategies. 



Main Issues of Computer Vision 

The latter definition proposes to use Artificial Intelligence technologies for 
solving problems of representation and reasoning. The interesting objects 
must be extracted from the image leading to a description of the 2D image 
situation. Based on that, the 3D world situation must be derived. At least 
four main issues are left open and have to be treated in any Computer Vision 
system. 



1. Which types of representation for 3D world situations are appro- 
priate ? 

2. Where do the models for detection of 2D image situations origi- 
nate ? 

3. Which reasoning or matching techniques are appropriate for de- 
tection tasks ? 

4. How should the gap between 2D image and 3D world situations 
be bridged ? 



Non-realistic Desires in Computer Vision 

This paradigm of Computer Vision resembles the enthusiastic work in the 
nineteen sixties on developing a Ceneral Problem Solver [118]. Nowadays, the 
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efforts for a General Problem Solver appear hopeless and ridiculous, and it is 
similarly ridiculous to strive for a General Vision System, which is supposed 
to solve any specific vision task [2] . Taking the four main issues of Computer 
Vision into account, a general system would have to include the following 
four characteristics. 



1. A unifying representation framework for dealing with various rep- 
resentations of signals and symbols. 

2. Common modeling tools for acquiring models, e.g. for reconstruc- 
tion from images or for generation of CAD data. 

3. General reasoning techniques {e.g. in fuzzy logic) for extracting 
relevant image structures, or general matching procedures for rec- 
ognizing image structures. 

4. General imaging theories to model the mapping from 3D world 
into 2D images (executed by the cameras). 



Continuing with the train of thought, a General Vision System would 
have to be designed as a shell. This is quite similar to Expert System Shells 
which include general facilities of knowledge representation and reasoning. 
Various categories of knowledge, ranging from specific scene/task knowledge 
to general knowledge about the use of image processing libraries, are supposed 
to be acquired and filled into the shell on demand. Crevier and Lepage present 
an extensive survey of knowledge-based image understanding systems [43], 
however, they concede that ’’genuine general-purpose image processing shells 
do not yet exist.” In summary, representation frameworks, modeling tools, 
reasoning and matching techniques, and imaging theories are not available in 
the required generality. 

Favouring Robot Vision in Opposition to Computer Vision 

The statement of this book is that the required generality can never be 
reached, and that degradations in generality are acceptable in practical sys- 
tems. However, current Computer Vision systems (in industrial use) only 
work well for specific scenes under specific imaging conditions. Furthermore, 
this specificity has also influenced the design process, and, consequently, there 
is no chance to adapt a classical system to different scenes. 



New design principles for more general and flexible systems are nec- 
essary in order to overcome to a certain extent the large gap between 
general desire and specific reality. 



These principles can be summarized briefly by animated attention, purpo- 
sive perception, visual demonstration, compatible perception, biased learning, 
and feedback analysis. The following discussion will reveal that all principles 
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are closely connected with each other. The succinct term Robot Vision is used 
for systems which take these principles into account.^ 

1.2.2 Characterization of Robot Vision 

Animated Vision by Attention Control 

It is assumed that most of the three-dimensional vision-related applications 
must be treated by analyzing images at different viewing angles and/or dis- 
tances [12, 1]. Through exploratory controlled camera movement the system 
gathers information incrementally, i.e. the environment serves as external 
memory from which to read on demand. This paradigm of animated vision 
also includes mechanisms of selective attention and space- variant sensing [40] . 
Generally, a two-part strategy is involved consisting of attention control and 
detailed treatment of the most interesting places [145, 181]. This approach 
is a compromise for the trade-off between effort of computations and sensing 
at high resolution. 

Purposive Visual Information 

Only that information of the environmental world must be extracted from 
the images which is relevant for the vision task. The modality of that infor- 
mation can be of quantitative or qualitative nature [4] . In various phases of a 
Robot Vision task presumably different modalities of information are useful, 
e.g. color information for tracking robot fingers, and geometric information 
for grasping objects. The minimalism principle emphasizes to solve the task 
by using features as basic as possible [87], i.e. avoiding time-consuming, er- 
roneous data abstraction and high-level image representation. 

Symbol Grounding by Visual Demonstration 

Models, which represent target situations, will only prove useful if they are 
acquired in the same way, or under the same circumstances, as when the sys- 
tem perceives the scene in real application [75] . It is important to have a close 
relation between physically grounded task specifications and the appearance 
of actual situations [116]. Furthermore, it is easier for a person to specify tar- 
get situations by demonstrating examples instead of describing visual tasks 
symbolically. Therefore, visual demonstration overcomes the necessity of de- 
termining quantitative theories of image formation. 

Perception Compatibility (Geometry/Photometry) 

In the imaging process, certain compatibilities hold between the (global) ge- 
ometric shape of the object surface and the (local) gray value structure in 
the photometric image [108]. However, there is no one-to-one correspondence 

^ The adequacy will become obvious later on. 
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between surface discontinuities and extracted gray value edges, e.g. due to 
texture, uniform surface color, or lighting conditions. Consequently, qualita- 
tive compatibilities must be exploited, which are generally valid for certain 
classes of regular objects and certain types of camera objectives, in order to 
bridge the global-to-local gap of representation. 

Biased Learning of Signal Transformation 

The signal coming from the imaging process must be transformed into 2D 
or 3D features, whose meaning depends on the task at hand, e.g. serving 
as motor signal for robot control, or serving as symbolic description for a 
user. This transformation must be learned on the basis of samples, as there 
is no theory for determining it a priori. Each signal is regarded as a point 
in an extremely high-dimensional space, and only a very small fraction will 
be considered by the samples of the transformation [120]. Attention control, 
visual demonstration, and geometry/photometry compatibilities are taken 
as bias for determining the transformation, which is restricted to a relevant 
signal sub-space. 

Feedback-Based Autonomous Image Analysis 

The analysis algorithms used for signal transformation require the setting or 
adjustment of parameters [101]. A feedback mechanism is needed to reach 
autonomy instead of adjusting the parameters interactively [180]. A cyclic 
process of quality assessment, parameter adjustment, and repeated applica- 
tion of the algorithm can serve as backbone of an automated system [126]. 

For the vast majority of vision-related tasks only Robot Vision systems 
can provide pragmatic solutions. The possibility of camera control and selec- 
tive attention should be exploited for resolving ambiguous situations and for 
completing task-relevant information. The successful execution of the visual 
task is critically based on autonomous learning from visual demonstration. 
The online adaptation of visual procedures takes possible deviations between 
learned and actual aspects into account. Learning and adaptation are biased 
under general compatibilities between geometry and photometry of image 
formation, which are assumed to hold for a category of similar tasks and a 
category of similar camera objectives. 



General representation frameworks, reasoning techniques, and imag- 
ing theories are no longer needed, rather, task-related representations, 
operators, and calibrations are learned and adapted on demand. 



The next Section 1.3 will demonstrate that these principles of Robot 
Vision are in consensus with new approaches to designing autonomous robot 
systems. 
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1.3 Robot Systems versus Autonomous Robot Systems 



Robots work in environments which are more or less customized to the di- 
mension and the needs of the robot. 



1.3.1 Characterization of a Robot System 

Definition 1.3 (Robot System) A robot system is a mechanical device 
which can he programmed to move in the environment and handle objects or 
tools. The hardware consists essentially of an actuator system and a computer 
system. The actuator system is the mobile and/or agile body which consists 
of the effector component (exterior of the robot body) and the drive compo- 
nent (interior of the robot body). The effectors physically interact with the 
environment by steering the motors of the drive. Examples for effectors are 
the wheels of a mobile robot (robot vehicle) or the gripper of a manipulation 
robot (manipulator, robot arm). The computer system is composed of general 
and/or special purpose processors, several kinds of storage, etc., together with 
a power unit. The software consists of an interpreter for transforming high- 
level language constructs into an executable form and procedures for solving 
the inverse kinematics and sending steering signals to the drive system. 

Advanced robot systems are under development which will be equipped 
with a sensor or camera system for perceiving the environmental scene. Based 
on perception, the sensor or camera system must impart to the robot an 
impression of the situation wherein it is working, and thus the robot can 
take appropriate actions for more flexibly solving a task. The usefulness of 
the human visual system gives rise to develop robots equipped with video 
cameras. The video cameras of an advanced robot may or may not be a part 
of the actuator system. 

In camera-equipped systems the robots can be used for two alternative 
purposes leading to a robot- supported vision system (robot- for-vision tasks) or 
to a vision- supported robot system (vision- for-robot tasks). In the first case, a 
purposive camera control is the primary goal. For the inspection of objects, 
factories, or processes, the cameras must be agile for taking appropriate im- 
ages. A separate actuator system, i.e. a so-called robot head, is responsible 
for the control of external and/or internal camera parameters. In the second 
case, cameras are fastened on a stable tripod {e.g. eye-off-hand system) or 
fastened on an actuator system {e.g. eye-on-hand system), and the images 
are a source of information for the primary goal of executing robot tasks 
autonomously. For example, a manipulator may handle a tool on the basis of 
images taken by an eye-off-hand or an eye-on-hand system. In both cases, a 
dynamic relationship between camera and scene is characteristic, e.g. inspect- 
ing situations with active camera robots, or handling tools with vision-based 
manipulator robots. For more complicated applications the cameras must be 
separately agile in addition to the manipulator robot, i.e. having a robot of 
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its own just for the control of the cameras. For those advanced arrangements, 
the distinction between robot-supported vision system and vision-supported 
robot system no longer makes sense, as both types are fused. 

The most significant issue in current research on advanced robot systems 
is to develop an infrastructure, based on which a robot system can learn 
and adapt task-relevant competences autonomously. In the early nineteen 
nineties. Brooks made clear in a series of papers that the development of 
autonomous robots must be based on completely new principles [26, 27, 28]. 
Most importantly, autonomous robots can not emerge by simply combining 
results from research on Artificial Intelligence and Computer Vision. Research 
in both fields concentrated on reconstructing symbolic models and reasoning 
about abstract models, which was quite often irrelevant due to unrealistic 
assumptions. Instead of that, an intelligent system must interface directly to 
the real world through perception and action. This challenge can be handled 
by considering four basic characteristics that are tightly connected with each 
other, i.e. situatedness, corporeality, emergence, and competence. Autonomous 
robots must be designed and organized into task-solving behaviors, taking the 
four basic characteristics into account.^ 



1.3.2 Characterization of an Autonomous Robot System 
Situatedness 

The autonomous robot system solves the tasks in the total complexity of 
concrete situations of the environmental world. The task-solving process is 
based on situation descriptions, which must be acquired continually using 
sensors and/or cameras. Proprioceptive and exteroceptive features of a sit- 
uation description are established, which must be adequate and relevant for 
solving the specific robot task at hand. Proprioceptive features describe the 
internal state of the robot, e.g. the coordinates of the tool center point, which 
can be changed by the inherent degrees of freedom. Exteroceptive features 
describe aspects in the environmental world and, especially, the relationship 
between robot and environment, e.g. the distance between robot hand and 
target object. The characteristic of a specific robot task is directly correlated 
with a certain type of situation description. For example, for robotic object 
grasping the exteroceptive features describe the geometric relation between 
the shape of the object and the shape of the grasping fingers. However, for 
robotic object inspection another type of situation description is relevant, e.g. 
the silhouette contour of the object. Based on the appropriate type of situ- 
ation description, the autonomous robot system must continually interpret 
and evaluate the concrete situations correctly. 

® In contrast to Brooks [27], we prefer the term corporeality instead of embodi- 
ment and competenee instead of intelligence, both replacements seem to be more 
appropriate (see also Sommer [161]). 
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Corporeality 

The camera-equipped robot system experiences the world under physical con- 
straints of the robot and optical constraints of the camera. These robot and 
camera characteristics affect the task-solving process crucially. Robots and 
cameras are themselves part of the scene, and therefore in a situation de- 
scription the proprioceptive features are correlated with the exteroceptive 
features. For example, if a camera is fastened on a robot hand with the opti- 
cal axis pointing directly through parallel jaw fingers, then the closing of the 
fingers is reflected both in the proprioceptive and exteroceptive part of the 
situation description. The specific type of the camera system must be chosen 
to be favourable for solving the relevant kind of tasks. The camera system 
can be static, or fastened on a manipulator, or separate and agile using a 
robot head, or various combinations of these arrangements. The perception 
of the environmental world is determined by the type of camera objectives, 
e.g. the focal length influences the depicted size of an object, the field of 
view, and possible image distortions. Therefore, useful classes of 2D situa- 
tions in the images can only be acquired based on actually experienced 3D 
situations. The specific characteristic of robot and camera must be taken into 
account directly without abstract modeling. A purposive robotic behavior is 
characterized by situatedness and is biased by the corporeality of the robotic 
equipment. 

Competence 

An autonomous robot system must show competent behaviors when working 
on tasks in the real environment. Both expected and unforeseen environmen- 
tal situations may occur upon which the task-solving process has to react ap- 
propriately. The source of competence originates in signal transformations, 
mainly including feature extraction, evaluation of situations and construc- 
tion of mappings between situations and actions. Environmental situations 
may represent views of scene objects, relations between scene objects, or re- 
lations between robot body and scene objects. The specific type of situation 
description is determined under the criterion of task-relevance including a 
minimalism principle. That is, from the images only the minimum amount 
of quantitative or qualitative information should be extracted which is abso- 
lutely necessary for solving the specific task. In order to reach sufficient levels 
of competence, the procedures for minimalistic feature extraction and situ- 
ation evaluation can hardly be programmed, but have to be learned on the 
basis of demonstrations in an offline phase. Additionally, purposive situation- 
action pairs have to be learned which serve as ingredients of visual servoing 
mechanisms. A servoing procedure must catch the actual dynamics of the rela- 
tionship between effector movements and changing environmental situations, 
i.e. actual dynamics between proprioceptive and exteroceptive features. The 
purpose is to reach a state of equilibrium between robot and environment, 
without having exact models from environment and camera. 
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Emergence 

The competences of an autonomous robot system must emerge from the sys- 
tem’s interactions with the world. The learned constructs and servoing mech- 
anisms can be regarded as backbone for enabling these interactions. In the 
online phase the learned constructs work well for expected situations leading 
to basic competences in known environments. However, degradations occur 
in environments of deviating or unforeseen situations. The confrontation with 
actual situations causes the system to revisit and maybe adapt certain ingre- 
dients of the system on demand. Consequently, more robust competences are 
being developed by further training and refining the system on the job. Fur- 
thermore, competences at higher levels of complexity can be developed. The 
repeated application of servoing cycles give rise to group perception-action 
pairs more compactly which will lead to macro actions. Additionally, based on 
exploration strategies applied in unknown environments, the system can col- 
lect data with the purpose of building a map, and based on that, constructing 
more purposive situation-action pairs. By considering concepts as mentioned 
before, high level competences will emerge from low level competences which 
in turn are based on learned constructs and servoing mechanisms. Common 
sense knowledge about methodologies of situation recognition or strategies 
of scene exploration should be made available to the system. This knowledge 
reduces the need to let learn and/or emerge everything. However, before tak- 
ing this knowledge for granted, the validity must be tested in advance under 
the actual conditions. 

Behavioral Organization 

Each behavior is based on an activity producing subsystem, featuring sens- 
ing, processing, and acting capabilities. The organization of behaviors begins 
on the bottom level with very simple but complete subsystems, and follows 
an incremental path ending at the top-level with complex autonomous sys- 
tems. In this layered organization, all behaviors have permanent access to the 
specific sensing facility and compete in gaining control over the effector. In 
order to achieve a reasonable global behavior, a ranking of importance is con- 
sidered for all behaviors, and only the most important ones have a chance to 
become active. The relevant behavior or subset of behaviors are triggered on 
occasion of specific sensations in the environment. For example, the obstacle- 
avoiding behavior must become active before collision, in order to guarantee 
the survival of the robot, otherwise the original task-solving behavior would 
be active. 

The characteristics of autonomous robot systems, i.e. situatedness, cor- 
poreality, competence, emergence, and behavioral organization, have been 
formulated at an abstract level. For the concrete development of autonomous 
camera- equipped robot systems, one must lay the appropriate foundations. We 
propose, that the characteristics of Robot Vision just make up this founda- 
tion. They have been summarized in Subsection 1.2.2 by animated attention. 
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purposive perception, visual demonstration, compatible perception, biased 
learning, and feedback analysis. Figure 1.1 shows the characteristics of an au- 
tonomous robot system (top level) together with the characteristics of Robot 
Vision (bottom level), which together characterize an autonomous camera- 
equipped robot system. As learned in the discussion of alternative purposes 
of robots (Subsection 1.3.1), an autonomous camera-equipped robot system 
can be a robot-supported vision system, a vision-supported robot system, or 
a combination of both. 




1.3.3 Autonomous Camera-Equipped Robot System 

Definition 1.4 (Autonomous camera-equipped robot system) An 

autonomous camera- equipped robot system is a robot system including robot 
heads and/ or cameras, which shows autonomous task-solving behaviors of vi- 
sual perception and/or action. The basis for autonomy is situatedness, eorpo- 
reality, competence, and emergence. These characteristics can be reached by 
animated attention, purposive perception, visual demonstration, compatible 
perception, biased learning, and feedback analysis. 

All characteristics of an autonomous camera-equipped robot system are 
highly correlated to each other. Any one could be taken as seed from which to 
evolve the others. This becomes obvious in the next Section 1.4, in which the 
role of visual demonstration is explained with regard to learning in reality. 
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1.4 Important Role of Demonstration and Learning 

In an autonomous camera-equipped robot system, the 3D spatial relation be- 
tween effector, camera(s), and/or object(s) must change according to a task- 
relevant strategy. The images produced by camera(s) are input of autonomy- 
relevant functions which are responsible for generating appropriate control 
signals for the effectors. 

It is obvious that a 3D world situation between objects or between object 
and effector appears differently in the 2D images in the case of varying viewing 
positions [96] or varying lighting conditions [17]. Conversely, different 3D 
world situations can lead to similar 2D images (due to loss of one dimension) 
or dissimilar 2D images. Therefore, classes of image feature values must be 
determined which originate from a certain 3D situation or a certain set of 3D 
situations. In this work, two types of classes will be distinguished, and, related 
to them, the concepts of compatibilities and manifolds will be introduced. 



Compatibilities describe constraints on the process of image forma- 
tion which do hold, more or less, under task-relevant or random vari- 
ations of the imaging conditions. 



The class of image feature values involved in a compatibility is represented 
by a representative value together with typical, small deviations. 



Manifolds describe significant variations of image feature values which 
originate under the change of the 3D spatial relation between effector, 
camera, and/or objects. 



The class of image feature values involved in a manifold is represented 
extensively by basis functions in a canonical system. Visual demonstration is 
the basis for learning compatibilities and manifolds in the real task-relevant 
world. Generally, the learned compatibilities can be used for parameterizing 
prior vision algorithms and manifolds can be used for developing new vision 
algorithms. 

For the vision-based control of the effector, the relationship between en- 
vironment, effector, and image coordinates must be determined. Specifically, 
the purpose is to transform image coordinates of objects into the coordinate 
system of the effector. The autonomy-relevant functions for effector control 
are based on the combined use of feature compatibilities, feature manifolds, 
and environment-effector-image relationships. 

1.4.1 Learning Feature Compatibilities under Real Imaging 

By eliciting fundamental principles underlying the process of image forma- 
tion, one can make use of a generic bias, and thus reduce the role of object- 
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specific knowledge for structure extraction and object recognition in the im- 
age [117, pp. 26-30]. Theoretical assumptions {e.g. projective invariants) con- 
cerning the characteristic of image formation which can be proven nicely for 
simulated pinhole cameras, generally do not hold in practical applications. 
Instead, realistic qualitative assumptions (so-called compatibilities) must be 
learned in an offline phase prior to online application. 

Compatibility of Regularities under Geometric Projection 

Shape descriptions are most important if they are invariants under geomet- 
ric projection and change of view. First, the perceptual organization of line 
segments into complex two-dimensional constructs, which originate from the 
surface of three-dimensional objects, can be based on invariant shape regular- 
ities. For example, simple constructs of parallel line segment pairs are used by 
Yla-Jaaski and Ade [179], or sophisticated constructs of repeated structures 
or rotational symmetries by Zisserman et al. [182]. Second, invariant shape 
regularities are constant descriptions of certain shape classes and, therefore, 
can be used as indices for recognition [143] . A real camera, however, executes 
a projective transformation in which shape regularities are relaxed in the im- 
age, e.g. three-dimensional symmetries are transformed into two-dimensional 
skewed symmetries [77]. More generally, projective quasi-invariants must be 
considered instead of projective invariants [20]. 

By demonstrating sample objects including typical regularities and visu- 
ally perceiving the objects using actual cameras, one can make measurements 
of real deviations from regularities (two-dimensionally projected), and thus 
learn the relevant degree of compatibility. 

Compatibility of Object Surface and Photometric Invariants 

Approaches for recognition and/or tracking of objects in images are con- 
fronted with variations of the gray values, caused by changing illumination 
conditions. The object illumination can change directly with daylight and/or 
the power of light bulbs, or can change indirectly by shadows arising in the 
spatial relation between effector, camera, and object. The problem is to con- 
vert color values or gray values, which depend on the illumination, into de- 
scriptions that do not depend on the illumination. However, solutions for 
perfect color constancy are not available in realistic applications [55], and 
therefore approximate photometric invariants are of interest. For example, 
normalizations of the gray value structure by standard or central moments 
of second order can improve the reliability of correlation techniques [148] . 

By demonstrating sample objects under typical changes of the illumina- 
tion one can make measurements of real deviations from exact photometric 
invariants, and thus learn the relevant degree of compatibility. 
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Compatibility of Geometric and Photometric Image Features 

The general assumption behind all approaches of object detection and bound- 
ary extraction is that three-dimensional surface discontinuities should have 
corresponding gray value edges in the image. ^ Based on this, a compatibility 
between the geometric and photometric type of object representation must 
hold in the image. For example, the orientation of an object boundary line in 
the image must be similar to the orientation of a gray value edge of a point 
on the line [135]. A further example, the junction angle of two boundary lines 
must be similar to the opening angle of the gray value corner at the intersec- 
tion point of the lines. The geometric line features are computed globally in 
an extended patch of the image, and the photometric edge or corner feature 
are computed locally in a small environment of a point. Consequently, by the 
common consideration of geometric and photometric features one also verifies 
the compatibility between global and local image structure. 

By demonstrating sample objects including typical edge curvatures and 
extracting geometric and photometric image features, one can compare the 
real measurements and learn the relevant degree of compatibility. 

Compatibility of Motion Invariants and Changes in View Sequence 

In an autonomous camera-equipped robot system, the spatial relation be- 
tween camera(s) and object (s) changes continually. The task-solving process 
could be represented by a discrete series of changes in this spatial relation, 
e.g. one could consider the changing relations for the task of moving the 
robot hand of a manipulator towards a target object while avoiding obstacle 
objects. Usually, there are different possibilities of taking trajectories sub- 
ject to the constraint of solving the task. A cost function must be used for 
determining the cheapest course. Beside the typical components of the cost 
function, i.e. distance to goals and obstacles, it must also include a measure of 
difficulty of extracting and tracking task-relevant image features. This aspect 
is directly related with the computational effort of image sequence analysis 
and, therefore, has influence on the real-time capability of an autonomous 
robot system. By constraining the possible camera movements appropriately, 
the flow vector fields originating from scene objects are easy to represent. For 
example, a straight camera movement parallel over a plane face of a three- 
dimensional object should reveal a uniform flow field at the face edges. A 
further example, if a camera is approaching an object or is rotating around 
the optical axis which is normal to the object surface, then log-polar trans- 
formation (LPT) can be applied to the gray value images. The motivation 
lies in the fact that during the camera movement, simple shifts of the trans- 
formed object pattern occur without any pattern distortions [23]. However, in 
the view sequence these invariants only hold for a simulated pinhole camera 

^ See the book of Klette et al. [90] for geometric and photometric aspects in Com- 
puter Vision. 
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whose optical axis must be kept accurate normal to the object surface while 
moving the camera. 

By demonstrating sample objects and executing typical camera move- 
ments relative to the objects, one can make measurements of real deviations 
from uniformity of the flow field in original gray value or in transformed im- 
ages, and thus learn the relevant degree of compatibility between 3D motions 
and 2D view sequences. 

Invariants Are Special Cases of Compatibilities 

In classical approaches of Computer Vision, invariants are constructed for a 
group of transformations, e.g. by eliminating the transformation parameters 
[111]. In real applications, however, the actual transformation formula is not 
known, and for solving a certain robot task only a relevant subset of trans- 
formations should be considered (possibly lacking characteristics of a group) . 
The purpose of visual demonstration is to consider the real corporeality of 
robot and camera by learning realistic compatibilities (involved in the imag- 
ing process) instead of assuming non-realistic invariants. Mathematically, a 
compatibility must be attributed with a statistical probability distribution, 
which represents the probabilities that certain degrees of deviation from a 
theoretical invariant might occur in reality. In this work, Gaussian probabil- 
ity distributions are considered, and based on that, the Gaussian extent value 
a can be used to define a confidence value for the adequacy of a theoretical 
invariant. The lower the value of a, the more confident is the theoretical in- 
variant, i.e. the special case of a compatibility with a equal to 0 characterizes 
a theoretical invariant. In an experimentation phase, the a values of interest- 
ing compatibilities are determined by visual demonstration and learning, and 
in the successive application phase, the learned compatibilities are considered 
in various autonomy-relevant functions. This methodology of acquiring and 
using compatibilities replaces the classical concept of non-realistic, theoretical 
invariants. 

The first attempt of relaxing invariants has been undertaken by Binford 
and Levitt, who introduced the concept of quasi-invariance under transfor- 
mations of geometric features [20]. The compatibility concept in our work is 
a more general one, because more general transformations can be considered, 
maybe with different types of features prior and after the mapping. 

1.4.2 Learning Featnre Manifolds of Real World Situations 

For the detection of situations in an image, i.e. in answer to the question 
’’Where is which situation ?” [12], one must acquire models of target situ- 
ations in advance. There are two alternatives for acquiring such model de- 
scriptions. In the first approach, detailed models of 3D target situations and 
projection functions of the cameras are requested from the user, and from 
that the relevant models of 2D target situations are computed [83] . In many 
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real world applications, however, the gap between 2D image and 3D world 
situations is problematic, i.e. it is difficult, costly, and perhaps even impos- 
sible to obtain realistic 3D models and realistic projection functions.® In the 
second approach, descriptions of 2D target situations are acquired directly 
from image features based on visual demonstration of 3D target situations 
and learning of feature manifolds under varying conditions [112]. For many 
tasks to be carried out in typical scenes, this second approach is preferable, 
because actual objects and actual characteristics of the cameras are consid- 
ered directly to model the 2D target situations. A detection function must 
localize meaningful patterns in the image and classify or evaluate the features 
as certain model situations. The number of task-relevant image patterns is 
small in proportion to the overwhelming number of all possible patterns [136], 
and therefore a detection function must represent the manifolds of relevant 
image features implicitly. 

In the following we use the term feature in a general sense. An image 
pattern or a collection of elementary features extracted from a pattern will 
simply be called a feature. What we really mean by a feature is a vector 
or even a complicated structure of elementary (scalar) features. This simpli- 
fication enables easy reading, but where necessary we present the concrete 
specification. 

Learning Feature Manifolds of Classified Situations 

The classification of a feature means assigning it to those model situation 
whose feature manifold contains the relevant feature most appropriately, e.g. 
recognize a feature in the image as a certain object. Two criteria should be 
considered simultaneously, robustness and efficiency, and a measure is needed 
for both criteria in order to judge different feature classifiers. For the robust- 
ness criterion, a measure can be adopted from the literature on statistical 
learning theory [170, 56] by considering the definition of probably approxi- 
mately correct learning, (PAC-learning). A set of model situations is said to 
be PAC-learned if, with a probability of at least P’’, a maximum percentage 
E of features is classified erroneous. Robustness can be defined reasonably 
by the quotient of P’’ by E, i.e. the higher this quotient, the more robust 
is the classifier. It is conceivable that high robustness requires an extensive 
amount of attributes for describing the classifier. In order, however, to reduce 
the computation effort of classifying features, a minimum description length 
of the classifier is prefered [139]. For the obvious conflict between robustness 
and efficiency a compromise is needed. 

® Recent approaches of this kind nse more general, parametric models which ex- 
press certain nnknown variabilities, and these are verified and fine-tuned under 
the actual situations in the images [97]. With regard to the qualitativeness of 
the models, these new approaches are similar to the concept of compatibilities 
in our work, as discussed above. 
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By demonstrating appearance patterns of classified situations, one can 
experimentally learn several versions of classifiers and finally select the one 
which carries out the best compromise between robustness and efficiency. 

Learning Feature Manifolds of Scored Situations 

Task-relevant changes of 3D spatial relations between effector, camera(s), 
and/or object(s) must be controlled by assessing for the stream of images the 
successive 2D situations relative to the 2D goal situation. The intermediate 
situations are considered as discrete steps in a course of scored situations 
up to the main goal situation. Classified situations (see above) are a special 
case of scored situations with just two possible scores, e.g. values 0 or 1. In 
the continual process of robot servoing, e.g. for arranging, grasping, or view- 
ing objects, the differences between successive 2D situations in the images 
must correlate with certain changes between successive 3D spatial relations. 
Geometry-related features in the images include histograms of edge orienta- 
tion [148], results of line Hough transformation [123], responses of situation- 
specific Gabor filters [127], etc. Feature manifolds must characterize scored 
situations, e.g. the gripper is 30 percent off from the optimal grasping situa- 
tion. Both for learning and applying these feature manifolds the coherence of 
situations can be taken into account. A course of scored situations is said to 
be PAG-learned if, with a probability of at least P’’, a maximum deviation 
D from the actual score is given. 

By demonstrating appearance patterns of scored situations, one can learn 
several versions of scoring modules and finally select the best one. Experi- 
ments with this scoring module are relevant for determining the degree of 
correlation between scored situations in the images and certain 3D spatial 
relations. 

Systems of Gomputer Vision include the facility of geometric modeling, 
e.g. by operating with a GAD subsystem (computer aided design). The pur- 
pose is to incorporate geometric models, which are needed for the recognition 
of situations in the images. However, in many realistic applications one is 
asking too much of the system user if requested to construct the models for 
certain target situations. By off-line visual demonstration of situations under 
varying, task-relevant conditions, the model situations can be acquired and 
represented as manifolds of image features directly. In this work, an approach 
for learning manifolds is presented which combines Gaussian basis function 
networks [21, pp. 164-193] and principal component analysis [21, pp. 310-319]. 

The novelty of the approach is that the coherence of certain situations will 
be used for constraining (local and global) the complexity of the appearance 
manifold. The robustness improves both in recognition and scoring functions. 

1.4.3 Learning Environment-Effector- Image Relationships 

The effector interacts with a small environmental part of the world. For ma- 
nipulating or inspecting objects in this environment, their coordinates must 
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be determined relative to the effector. The relationship between coordinates 
in the image coordinate system and the effector coordinate system must be 
formalized [173, 78]. The relevant function can be learned automatically by 
controlled effector movement and observation of a calibration object. 

Learning Relationships for an Eye-off-Hand System 

For an eye-off-hand system, the gripper of a manipulator can be used as 
calibration object which is observed by cameras without physical connection 
to the robot arm. The gripper is steered by a robot program through the 
working space, and the changing image and manipulator coordinates of it are 
used as samples for learning the relevant function. 

Learning Relationships for an Eye-on-Hand System 

For an eye-on-hand system the camera(s) is (are) fastened on the actuator 
system for controlling inspection or manipulation tasks. A natural or artifi- 
cial object in the environment of the actuator system serves as calibration 
object. First, the effector is steered by the operator (manually using the con- 
trol panel) into a certain relation to this calibration object, e.g. touching it or 
keeping a certain distance to it. In this way, the goal relation between effector 
and an object is stipulated, something which must be known in the applica- 
tion phase of the task-solving process. Specifically, a certain environmental 
point will be represented more or less accurately in actuator coordinates. Sec- 
ond, the effector is steered by a robot program through the working space, 
and the changing image coordinates of the calibration object and position 
coordinates of the effector are used as samples for learning the relevant func- 
tion. 

These strategies of learning environment-effector-image relationships are 
advantageous in several aspects. First, by controlled effector movement, the 
relevant function of coordinate transformation is learned directly, without 
computing the intrinsic camera parameters and avoiding artifical coordinate 
systems {e.g. external world coordinate system). Second, the density of train- 
ing samples can easily be changed by different discretizations of effector move- 
ments. Third, a natural object can be used instead of an artificial calibra- 
tion pattern. Fourth, task-relevant goal relations are demonstrated instead of 
modeling them artificially. 

The learned function is used for transforming image coordinates of objects 
into the coordinate system of the actuator system. 

1.4.4 Compatibilities, Manifolds, and Relationships 

A 3D spatial relation between effector and objects appears differently in the 
image depending on viewing and lighting conditions. Two concepts have been 
mentioned for describing the ensemble of appearance patterns, i.e. compati- 
bilities and manifolds. 
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Extracting Reliable and Discriminative Image Features 

By uncovering compatibilities under real image formation, one determines 
parameters and constraints for prior vision algorithms. This method par- 
tially regularizes the ill-posed problem of feature extraction [130]. A principal 
purpose of learning and applying compatibilities is to obtain more reliable 
features from the images without taking specific object models into account. 
Furthermore, certain compatibilities can lead to the extraction of image fea- 
tures which are nearly constant for the ensemble of appearance patterns of 
the 3D spatial relation, z. e. the features are a common characterization of 
various appearances of the 3D spatial relation. However, the extracted im- 
age features may not be discriminative versus appearance patterns which 
originate from other 3D spatial relations. 

Specifically, this aspect plays a significant role in the concept of mani- 
folds. Appearance patterns from different 3D spatial relations must be dis- 
tinguished, and the various patterns of an individual 3D relation should be 
collected. Manifolds of image features are the basis for a robust and efficient 
discrimination between different 3D situations, z. e. by constructing recogni- 
tion or scoring functions. The features considered for the manifolds have an 
influence on the robustness, and the complexity of the manifolds has an in- 
fluence on the efficiency of the recognition or scoring functions. Therefore, 
the results from learning certain compatibilities are advantageous for shaping 
feature manifolds. For example, by applying gray value normalization in a rel- 
evant window (compatibility of object surface and photometric invariants), 
one reduces the influence of lighting variations in the appearance patterns, 
and, consequently, the complexity of the pattern manifold is simplified. 

Role of Active Camera Control 

Both the complexity of manifolds and the validity of compatibilities are af- 
fected by constraints in spatial relations between effector, camera(s), and/or 
object(s). A first category of constraints is fixed by the corporeality of the 
camera-equipped robot system, z. e. the effectors, mechanics, kinematics, ob- 
jectives, etc. A second category of constraints is flexible according to the 
needs of solving a certain task. Two examples of constraints are mentioned 
regarding the relationship of the camera in the environment. First, camera 
alignment is useful for reducing the complexity of manifolds. By putting the 
camera of an eye-off-hand system in a standard position and/or orientation, 
one reduces the variety of different camera-object relations. In this case, the 
complexity of the appearance manifold depends only on position and/or il- 
lumination variations of the objects in the working environment. Second, 
controlled camera motion is useful in fulfilling compatibilities. For a detailed 
object inspection, the camera should approach closely. If the optical axis is 
kept normal to the object surface while moving the camera and applying 
log-polar transformation to the gray value images, then the resulting ob- 
ject pattern is merely shifting but not increasing. These two examples make 
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clear that the concepts of manifolds and compatibilities are affected by the 
animated vision principle. Cameras should be aligned appropriately for re- 
ducing the complexity of manifolds, and desired compatibilities can be taken 
as constraints for controlling the camera movement. 

Compatibilities and Manifolds in Robot Vision 

The two concepts of compatibilities and manifolds are essential to clarify the 
distinction between Computer Vision and Robot Vision. The concept of in- 
variance is replaced by the more general concept of compatibility, and the 
concept of geometric models is subordinated to feature manifolds. Both the 
compatibilities and the manifolds must be learned by (visual) demonstra- 
tions in the task-relevant environment. Actually, these principles contribute 
substantially to the main issues arising in Computer Vision (see Section 1.2), 
z.e. origin of models and strategy of model application. In recent years, sev- 
eral workshops have been organized dedicated to performance characteris- 
tics and quality of vision algorithms.® In the paradigm of Robot Vision, the 
task-relevant environment is the origin respective justification of the vision 
algorithms, and therefore it is possible to assess their actual quality. 



1.5 Chapter Overview of the Work 

Chapter 2 presents a novel approach for localizing a three-dimensional tar- 
get object in the image and extracting the two-dimensional polyhedral de- 
piction of the boundary. Polyhedral descriptions of objects are needed in 
many tasks of robotic object manipulation, e.g. grasping and assembling 
tasks. By eliciting the general principles underlying the process of image 
formation, we extensively make use of general, qualitative assumptions, and 
thus reduce the role of object-specific knowledge for boundary extraction. 
Geometric/photometric compatibility principles are involved in an approach 
for extracting line segments which is based on Hough transformation. The 
perceptual organization of line segments into polygons or arrangements of 
polygons, which originate from the silhouette or the shape of approximate 
polyhedral objects, is based on shape regularities and compatibilities of pro- 
jective transformation. An affiliated saliency measure combines evaluations 
of geometric/photometric compatible features with geometric grouping fea- 
tures. An ordered set of most salient polygons or arrangements is the basis for 
locally applying techniques of object recognition (see Chapter 3) or detailed 
boundary extraction. The generic approach is demonstrated for technical ob- 
jects of electrical scrap located in real-world cluttered scenes. 

Chapter 3 presents a novel approach for object or situation recognition, 
which does not require a priori knowledge of three-dimensional geometric 

® Interesting papers also have been collected in a special issue of the journal Ma- 
chine Vision and Applications [37]. 
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shapes. Instead, the task-relevant knowledge about objects or situations is 
grounded in photometric appearance directly. For the task of recognition, 
the appropriate appearance manifolds must be learned on the basis of visual 
demonstration. The feature variation in a manifold is specified algebraically 
by an implicit function for which certain deviations from ideal value 0 are 
accepted. Functions like these will serve as operators for the recognition of 
objects under varying view angle, view distance, illumination, or background, 
and also serve as operators for the recognition of scored situations. Mixtures 
of Gaussian basis function networks are used in combination with Karhunen- 
Loeve expansion for function approximation. Several architectures will be 
considered under the trade-off between efficiency, invariance, and discrim- 
inability of the recognition function. The versions take care for correlations 
between close consecutive view patterns (local in the manifold) and/or for 
the relationship between far distant view patterns (global in the manifold). 
The greatest strength of our approach to object recognition is the ability to 
learn compatibilities between various views under real world changes. Ad- 
ditionally, the compatibilities will have discriminative power versus counter 
objects. 

Chapter 4 uses the dynamical systems theory as a common framework for 
the design and application of camera-equipped robots. The matter of dynam- 
ics is the changing relation between robot effectors and environmental objects. 
We present a modularized dynamical system which enables a seamless transi- 
tion between designing and application phase and an uniform integration of 
reactive and deliberate robot competences. The designing phase makes use of 
bottom-up methodologies which need systematic experiments in the environ- 
ment and apply learning and planning mechanisms. Task-relevant deliberate 
strategies and parameterized control procedures are obtained and used in the 
successive application phase. This online phase applies visual feedback mech- 
anisms which are the foundation of vision-based robot competences. A layered 
configuration of dynamic vector fields uniformly represents the task-relevant 
deliberate strategies, determined in the designing phase, and the perception- 
action cycles, occuring in the application phase. From the algorithmic point 
of view the outcome of the designing phase is a configuration of concrete 
modules which is expected to solve the underlying task. For this purpose, we 
present three categories of generic modules, i.e. instructional, behavioral, and 
monitoring modules, which serve as design abstractions and must be imple- 
mented for the specific task. The resulting behavior in the application phase 
should meet requirements like task-relevance, robustness, flexibility, time lim- 
itation, etc. simultaneously. Based on a multi-component robotic equipment 
the system design and application is shown exemplary for a high-level task, 
which includes sub-tasks of localizing, approaching, grasping, and carrying 
objects. 

Chapter 5 summarizes the work and discusses future aspects of behavior- 
based robotics. 




2. Compatibilities for Object Boundary 
Detection 



This chapter presents a generic approach for object localization and boundary 
extraction which is based on the extensive use of feature compatibilities. 



2.1 Introduction to the Chapter 

The introductory section of this chapter embeds our methodology of object 
localization and boundary extraction in the general context of purposive, 
qualitative vision, then presents a detailed review of relevant literature, and 
finally gives an outline of the following sections.^ 

2.1.1 General Context of the Chapter 

William of Occam (ca. 1285-1349) was somewhat of a minimalist in medieval 
philosophy. His motto, known as Occam’s Razor, reads as follows: ”It’s vain 
to do with more what can be done with less. ” This economy principle (oppor- 
tunism principle) is self-evident in the paradigm of Purposive Vision [2]. 



1. The vision system should gather from the images only the infor- 
mation relevant for solving a specific task. 

2. The vision procedures should be generally applicable to a cate- 
gory of similar tasks instead of a single specific task. 



Ad 1. In the field of vision-supported robotics, an example for a category 
of similar tasks is the robot grasping of technical objects. In most cases a 
polyhedral approximation of the target object is sufficient (see the survey of 
robot grasp synthesis algorithms in [156]). For example, in order to grasp 
an object using a parallel jaw gripper, it is sufficient to reconstruct from the 
image a rectangular solid, although the object may have round corners or local 
protrusions. The corporeal form of the robot gripper affects the relevant type 
of shape approximation of the object, i.e. purposive, qualitative description 
of the geometric relation between gripper und object. 

^ The extraction of solely purposive information is a fundamental design principle 
of Robot Vision, see Chapter 1. 



J. Pauli: Learning-Based Robot Vision, LNCS 2048, pp. 25-99, 2001. 
© Springer-Verlag Berlin Heidelberg 2001 
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For the qualitative reconstruction of these shapes, a certain limited spec- 
trum of image analysis tools is useful which additionally depends on the 
characteristic of the camera system. For example, if a lens with large focal 
length is applied, then it is plausible to approximate the geometric aspect 
of image formation by perspective collineations [53]. As straight 3D object 
boundary lines are projected into approximate straight image lines, one can 
use techniques for straight line extraction, e.g. Hough transformation [100]. 
Altogether, the kind of task determines the degree of qualitativeness of the 
information that must be recovered from the images.^ A restriction to partial 
recovery is inevitable for solving a robot task in limited time with minimum 
effort [9] . In this sense, the designing of autonomous robot systems drives the 
designing of included vision procedures (see Chapters 1 and 4, and the work 
of Sommer [161]). 

Ad 2. Applying Occam’s Razor to the designing of vision procedures also 
means a search for applicability for a category of similar tasks instead of a 
single specific task. This should be reached by exploiting ground truths con- 
cerning the situatedness and the corporeality of the camera-equipped robot 
system. The ground truths are constraints on space-time, the camera system, 
and their relationship, which can be generally assumed for the relevant cate- 
gory of robot tasks. General assumptions of various types have been applied 
more or less successfully in many areas of image processing and Computer 
Vision. 

a) Profiles of step, ramp, or sigmoid functions are used as mathematical mod- 
els in procedures for edge detection [72]. 

b) For the perceptual organization of edges into structures of higher complex- 
ity {e.g. line segments, curve segments, ellipses, polygons), approaches of edge 
linking are applied which rely on Gestalt principles of proximity, similarity, 
closure, and continuation [103]. 

c) Object recognition is most often based on geometric quantities which are 
assumed to be invariant under the projective transformations used to model 
the process of image formation [110]. Usually, a stratification into euclidean, 
similarity, affine, and projective transformations is considered with the prop- 
erty that in this succession each group of transformations is contained in the 
next. The sets of invariants of these transformation groups are also orga- 
nized by subset inclusion but in reverse order, e.g. cross-ratio is an invariant 
both under projective and euclidean transformation, whereby length is only 
invariant under euclidean transformations. 

d) Frequently, in real applications the assumption of invariance is too strong 
and must be replaced by the assumption of quasi-invariance. Its theory and 
important role for grouping and recognition has been worked out by Binford 
and Levitt [20]. For example, Gros et al. use geometric quasi-invariants of 
pairs of line segments to match and model images [68]. 

^ See also the IEEE Workshop on Qualitative Vision for interesting contributions 

[4]. 
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e) Finally, for the reconstruction of 3D shape and/or motion the ill-posed 
problem is treated using regularization approaches, which incorporate smooth- 
ness and rigidity assumptions of the object surface [52]. 

Problems with the Use of Knowledge in Image Analysis 

A critical introspection reveals some problems concerning the applicability 
of all these constraints. For example, Jain and Binford have pointed out 
that smoothness and rigidity constraints of objects must be applied locally 
to image regions of depicted object surfaces, but the major problem is to 
find those meaningful areas [86] . Obviously, in real applications the listed as- 
sumptions are too general, and should be more directly related to the type of 
the actual vision task, i.e. the categories of relevant situations and goals. In 
knowledge-based systems for image understanding {e.g. the system of Liedtke 
et al. [102]), an extensive use of domain-specific knowledge was proposed [43]. 
These systems fit quite well to Marr’s theory of vision in the sense of striving 
for general vision systems by explicitly incorporating object-specific assump- 
tions [105]. However, the extensive use of knowledge contradicts Occam’s 
economy principle, and in many applications the explicit formulation of ob- 
ject models is difficult and perhaps even impossible. 

2.1.2 Object Localization and Boundary Extraction 

Having the purpose of robotic grasping and arranging in mind, we present 
a system for localizing approximate polyhedral objects in the image and 
extracting their qualitative boundary line configurations. The approach is 
successful in real-world robotic scenes which are characterized by clutter, oc- 
clusion, shading, etc. A global-to -local strategy is favoured, i.e. first to look 
for a candidate set of objects by taking only the approximate silhouette into 
account, then to recognize target objects of certain shape classes in the candi- 
date set by applying view-based approaches, and finally to extract a detailed 
boundary. 

Extracting Salient Polygons or Arrangements of Polygons 

Our approach for localization is to find salient polygons, which represent sin- 
gle faces or silhouettes of objects. The saliency of polygons is based on geo- 
metric/photometric compatible features and on geometric regularity features. 
The first category of features comprises compatibility evaluations between 
geometric and photometric line features and geometric and photometric junc- 
tion features. The Hough transformation is our basic technique for extracting 
line segments and for organizing them into polygons. Its robustness concern- 
ing parameter estimation is appreciated and the loss of locality is overcome by 
the geometric/photometric compatibility principles. For extracting and char- 
acterizing junctions, a corner detector is used in combination with a rotating 
wedge filter. The second category of features involved in the saliency measure 
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of polygons comprises geometric compatibilities under projective transforma- 
tion. Specifically, they describe regularity aspects of 3D silhouettes and 2D 
polygons, respectively. Examples for regularities are parallelism, right-angles, 
reflection- symmetry, translation-symmetry. 

The ordered places of most salient polygons are visited for special local 
treatment. First, an appearance-based approach can be applied for specific 
object recognition or recognition of certain shape classes (see Chapter 3).^ 
Second, a generic procedure can be applied for detailed boundary extrac- 
tion of certain shape classes, e.g. parallelepipeds. Our approach is to extract 
arrangements of polygons from the images by incorporating a parallelism 
compatibility, a pencil compatibility, and a vanishing-point compatibility, all 
of which originate from general assumptions of projective transformation of 
regular 3D shapes. A major contribution of this chapter is that the basic 
procedure of line extraction, i.e. Hough transformation, and all subsequent 
procedures are controlled by constraints which are inherent in the three- 
dimensional nature of the scene objects and inherent in the image formation 
principles of the camera system. 

General Principles instead of Specific Knowledge 

Our system is organized in several procedures for which the relevant assump- 
tions are clearly stated. The assumptions are related to the situatedness and 
corporeality of the camera-equipped robot system, i.e. compatibilities of reg- 
ular shapes under projective transformation and geometric/photometric com- 
patibilities of image formation. Furthermore, these assumptions are stratified 
according to decreasing generality, which imposes a certain degree of general- 
ity on the procedures. Concerning the objects in the scene, our most general 
assumption is that the object shape is an approximate polyhedron, and an 
example for a specific assumption is that an approximate parallelepiped is 
located in a certain area. We follow the claims of Occam’s minimalistic philos- 
ophy and elicit the general principles underlying the perspective projection of 
polyhedra, and then implement procedures as generally applicable as possi- 
ble. Based on this characterization of the methodology, relevant contributions 
in the literature will be reviewed. 



2.1.3 Detailed Review of Relevant Literature 

Object detection can be considered as a cyclic two-step procedure of localiza- 
tion and recognition [12], which is usually organized in several levels of data 
abstraction. Localization is the task of looking for image positions where 
objects of a certain class are located. In the recognition step, one of these 
locations is considered to identify the specific object. Related to the problem 

® Alternatively, in certain applications histoeram-based indexine approaches are 
also nseful [164]. 
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of boundary extraction, the task of localization is strongly correlated to per- 
ceptual organization, e.g. to organize those gray value edges which belong to 
the boundary of a certain object. 

Approaches to Perceptual Organization 

Sarkar and Boyer have reviewed the relevant work (up to year 1992) in per- 
ceptual organization [146] and proposed a four-level classification of the ap- 
proaches, i.e. signal level, primitive level, structural level, assembly level. For 
example, at the signal level pixels are organized into edge chains, at the prim- 
itive level the edge chains are approximated as polylines (i.e. sequences of 
line segments), at the structural level the polylines are combined to polygons, 
and at the assembly level several polygons are organized into arrangements. 
For future research they suggested: 



’’There is a need for research into frameworks for integration of var- 
ious Gestaltic cues including non-geometric ones ...” 



Sarkar and Boyer also presented a hierarchical system for the extraction 
of curvilinear or rectilinear structures [147]. Regularities in the distribution 
of edges are detected using ’’voting” methods for Gestaltic phenomena of 
proximity, similarity, smooth continuity and closure. The approach is generic 
in the sense that various forms of tokens can be treated and represented as 
graphs, and various types of structures can be extracted by applying stan- 
dardized graph analysis algorithms. Our approach incorporates several types 
of non-geometric cues {i.e. photometric features), treats closed line configu- 
rations of higher complexity including higher level Gestaltic phenomena, and 
from that defines a saliency measure for different candidates of line organi- 
zations. 

In a work of Zisserman et al. grouping is done at all four levels [182]. 
Line structures belonging to an object are extracted by using techniques of 
edge detection, contour following and polygonal approximation (signal level, 
primitive level, structural level). The representation is given by certain in- 
variants to overcome difficulties in recognizing objects under varying view- 
points. These geometric invariants are used to define an indexing function 
for selecting certain models of object shapes, e.g. certain types of polyhe- 
dra or surfaces of revolution. Based on a minimal set of invariant features, 
a certain object model is deduced, and based on that, a class-based group- 
ing procedure is applied for detailed boundary extraction (assembly level). 
For example, under affine imaging conditions the parallelism of 3D lines of a 
polyhedra also holds between the projected lines in the image. Accordingly, 
certain lines of the outer border of an object appear with the same orienta- 
tion in the interior of the silhouette. This gives an evidence of grouping lines 
for describing a polyhedron. Our approach contributes to this work, in that 
we introduce some assembly level grouping criteria for boundary extraction 
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of approximate polyhedra. These criteria are a parallelism compatibility, a 
pencil compatibility, and a vanishing-point compatibility. 

Castano and Hutchinson present a probabilistic approach to perceptual 
grouping at the primitive and structural level [33]. A probability distribu- 
tion over a space of possible image feature groupings is determined, and the 
most likely groupings are selected for further treatment. The probabilities 
are based on how well a set of image features fits to a particular geometric 
structure and on the expected noise in image data. The approach is demon- 
strated for two types of low-level geometric structures, z.e. straight lines and 
bilateral symmetries. Complex symmetrical structures consisting of groups 
of line segments are extracted in a work of Yla-Jaaski and Ade [179]. In a 
two-step procedure, pairs of line segments are first detected which are the 
basic symmetry primitives, and then several of them are selectively grouped 
along the symmetry axes of segment pairs. Our approach is more general in 
the sense that we treat further types of regularities and symmetries. 

Amir and Lindenbaum present a grouping methodology for both signal 
and primitive levels [3] . A graph is constructed whose nodes represent primi- 
tive tokens such as edges, and whose arcs represent grouping evidence based 
on collinearity or general smoothness criteria. Grouping is done by finding 
the best graph partition using a maximum likelihood approach. A measure 
for the quality of detected edge organizations is defined which could be used 
as decision function for selectively postprocessing certain groups. In our ap- 
proach, graph analysis is avoided at the primitive level of grouping edges 
(because it seems to be time-consuming), but is used for detecting polygons 
at the structural level. 

Cho and Meer propose an approach for detecting image regions by eval- 
uating a compatibility among a set of slightly different segmentations [36]. 
Local homogeneity is based on co-occurrence probabilities derived from the 
ensemble of initial segmentations, i.t. probabilities that two neighboring pix- 
els belong to the same image region. Region adjacency graphs at several levels 
are constructed and exploited for this purpose. In our work, image segmenta- 
tion is based on compatibilities between geometric and photometric features 
and on geometric regularity features which are compatible under projective 
transformation. 

Hough Transformation as Possible Foundation 

Frequently, Hough transformation has been used as basic procedure for group- 
ing at the primitive or structural level [100]. Specific arrangements of gray 
value edges are voting for certain analytic shapes, e.g. straight lines or el- 
lipses. For example, each line of edges creates a peak of votes in the space of 
line parameters, and the task is to localize the peaks. In order to make the 
Hough transformation more sensitive, one can go back to the signal level and 
take the orientations of the gray value edges into account. Since orientation 
is a parameter in a polar representation of a line, the number of possible line 
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orientations, for which a pixel may vote, can be reduced to the relevant one. 
The size of this voting kernel influences the sharpness of the Hough peaks 
[123], z.e. the accuracy of line parameters. The Hough image can be used for 
grouping at the structural level or even at the assembly level. The problem is 
to find especially those peaks which arise from lines belonging to a specific ob- 
ject. Princen et al. use a hierarchical procedure which extracts an exhaustive 
set of peaks and afterwards selects the relevant subset by applying Gestaltic 
grouping criteria [135]. Wahl and Biland extract objects from a polyhedral 
scene by representing an object boundary as a distributed pattern of peaks 
in the parameter space of lines [172]. Alternatively, Ballard introduced the 
generalized Hough transformation for the extraction of complex natural 2D 
shapes, in which a shape is represented in tabular form instead of an analytic 
formula [10]. 

This short review of extensions of the standard Hough transformation 
gives the impression that our Hough voting procedure can serve as basis for 
perceptual organization at all perception levels and for integrationg cues from 
all levels. The greatest weakness of the standard Hough transformation is the 
loss of locality, e.g. a line can gain support from pixels anywhere along its 
length from image border to border. Therefore, two or more line segments 
may be misinterpreted as one line, or short line segments may be overlooked. 
Consequently, Yang et al. introduce a weighted Hough transformation, in 
which the connectivity of a line is measured in order to detect also short line 
segments [178]. Similarly, Forest! et al. extend the Hough transformation to 
labeled edges [57]. Each label corresponds to a line segment which is extracted 
by a classical line following procedure taking connectivity and straightness in 
the course of edges into account. Our approach to overcome this problem is to 
apply three principles which are related to the geometric/photometric com- 
patibility. The first one is the line/edge orientation compatibility (mentioned 
above), the second one takes the position and characterization of gray value 
corners into account, and the third principle consists of checking the simi- 
larity of local phase features along the relevant line segment. Furthermore, 
locality of boundary extraction is reached by applying a windowed Hough 
transformation within the areas of most salient polygons. 



2.1.4 Outline of the Sections in the Chapter 

Section 2.2 recalls the definitions of standard Hough transformation and 
orientation-selective Hough transformation for line extraction. Geometric/ 
photometric compatible features are introduced, based on the principle of 
orientation compatibility between lines and edges, and on the principle of 
junction compatibility between pencils and corners. 

Section 2.3 defines regularity features of polygons, i.e. parallelism or 
right-angle between line segments and reflection-symmetry or translation- 
symmetry between polylines. For these features, certain compatibilities exist 
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under projective transformation. Furthermore, the principle of phase compat- 
ibility between parallel line segments (short parallel) on the one hand, and 
gray value ramps on the other hand is introduced. The regularity features are 
combined with the geometric/photometric compatible features in a generic 
procedure for extracting salient quadrangles or polygons. 

Section 2.4 introduces grouping criteria at the assembly level, i.e. the 
vanishing-point compatibility and the pencil compatibility. These assembly 
level criteria are integrated with the compatibilities which hold at the signal, 
primitive and structural levels or between them. Two generic procedures 
are presented for extracting the arrangements of polygons for approximate 
polyhedra. 

In Section 2.5, task-relevant visual demonstrations are taken into account 
for learning the degrees of the involved compatibilities, e.g. justification of 
the compatibilities for typical objects from scenes of electrical scrap. 

Section 2.6 discusses the approach on the basis of all introduced compat- 
ibilities which are assumed to be inherent in the three-dimensional nature of 
the scene objects, and/or inherent in the image formation principles of the 
camera system. 



2.2 Geometric/Photometric Compatibility Principles 

Obviously, the general assumption behind all approaches of boundary extrac- 
tion is that three-dimensional surface discontinuities must have corresponding 
gray value edges in the image. Nearly all problems can be traced back to a gap 
between the geometric and the photometric type of scene representation. This 
section introduces two examples of reasonable compatibilities between geomet- 
ric and photometric features, i.e. orientation compatibility between lines and 
edges, and junction compatibility between pencils and corners. The geomet- 
ric features are single straight lines and pencils of straight lines, respectively. 
Hough transformation is used as basic procedure for line extraction. The pho- 
tometric features are gray value edges and corners. Gradient magnitudes are 
binarized for the detection of gray value edges and Gabor wavelet operators 
are applied for estimating the orientations of the edges. The SUSAN operator 
is used for the detection and a rotating wedge filter for the characterization 
of gray value corners [159, 158]. 



2.2.1 Hough Transformation for Line Extraction 

For representing straight image lines, we prefer the polar form (see Figure 2.1) 
which avoids singularities. Let V be the set of discrete coordinate tuples 
p := {x\,X 2 )'^ for the image pixels of a gray value image with 7^, columns 
and Ih rows. A threshold parameter (5i specifies the permissible deviation 
from linearity for a sequence of image pixels. 




2.2 Geometric/Photometric Compatibility Principles 



33 




Fig. 2.1. Cartesian coordinate system with axes Xi and X 2 ; Polar form of a line 
with distance parameter r and angle parameter (j> taken relative to the image center. 



Definition 2.1 (Polar representation of a line) The polar representation 
of an image line is defined by 

f^{p,q) := xi ■ cos{(j)) + X 2 ■ sin{(j)) -r, \f^{p,q)\<6i (2.1) 

Parameter r is the distance from the image center to the line along a di- 
rection normal to the line. Parameter (j) is the angle of this normal direction 
related to the xi-axis. The two line parameters q := (r, are assumed to be 
discretized. This calls for the inequality symbol in equation (2.1) to describe 
the permissible deviation from the ideal value zero. For the parameter space 
Q we define a discrete two-dimensional coordinate system (see Figure 2.2). 
The horizontal axis is for parameter r whose values reach from to 
whereby Id is the length of the image diagonal. The vertical axis is for pa- 
rameter <j) whose values reach from 0° to 180° angle degrees. The discrete 
coordinate system can be regarded as a matrix consisting of Id columns and 
180 rows. 

Due to discretization, each parameter tuple is regarded as a bin of real- 
valued parameter combinations, i.e. it represents a set of image lines with 
similar orientation and position. The standard Hough transformation counts 
for each bin how many edges in a gray value image lie along the lines which 
are specified by the bin. In the Hough image these numbers of edges are 
represented for each parameter tuple. Each peak in the Hough image indicates 
that the gray value image T'^ contains an approximate straight line of edges, 
whose parameters are specified by the position of the peak. A binary image 
X'® is used which represents the edge points of by X® (p) = 1 and all the 

other points by T^{p) = 0. 
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Fig. 2.2. Coordinate system for the two-dimensional space of line parameters 
(Hough image); Horizontal axis R for parameter r reaching from — to +^, and 
vertical axis •P for parameter (j> reaching from 0 to 180. 



Definition 2.2 (Standard Hough transformation, SHT) The standard 
Hough transformation (SHT) of the binary image T^ relative to the polar 
form f^ of a straight line is of functionality Q — > [0, • • • , (/m • Ih)]- The 
resulting Hough image is defined by 

I^^{q):=#{per\I^{p) = l A |/^(p,g)| <<5i} (2.2) 

with symbol ff denoting the number of elements of a set. Figure 2.3 shows 
on top the gray value image of a dark box (used as dummy box in elec- 
trical equipment) and at bottom the binarized image of gray value edges. 
The Hough image T^^ of the standard Hough transformation is depicted 
in Figure 2.4. Typically, wide-spread maxima occur due to the reason that 
all edges near or on a line cause the SHT to not only increase the level of 
the relevant bin but also many in their neighborhood. We are interested in 
sharp peaks in order to easy locate them in the Hough image {i.e. extract the 
relevant lines from the gray value image) and estimate the line parameters 
accurately. By making the discretization of the parameter space more fine- 
grained, the maxima would be more sharpened and more accurate. However, 
the computational expenditure for the Hough transformation would increase 
significantly. 



2.2.2 Orientation Compatibility between Lines and Edges 

The conflict between accuracy of extracted lines and efficiency of line extrac- 
tion can be reduced by making use of an orientation compatibility between 
lines and gray value edges. 
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Fig. 2.3. (Top) Gray value image of an electrical dummy box; (Bottom) Binarized 
gradient magnitudes indicating the positions of gray value edges. 
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Fig. 2.4. Standard Hough transformation of the binarized image in Figure 2.3, 
i.e. accumulation array for discrete line parameters (distance r and angle 0). Wide- 
spread maxima in the Hough image (except for three sharp peaks). 



Assumption 2.1 (Line/edge orientation compatibility for a line 
point) The orientation 4> of a line of gray value edge points and orienta- 
tion T'^ip) of an edge at point p := {x\,X 2 )'^ on the line are approximately 
equal. The replacement of (f by 'l'^{p) in the polar form of the image line 
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implies just a small deviation from the ideal value zero. The necessary geo- 
metric/photometric compatibility is specified by parameter 62 ■ 

|a;i • 008(1*^ (xi,a: 2 )) + X 2 ■ sin(l‘^(a:i, X 2 )) - r| < 82 (2.3) 

A line of edge points may originate from the gray value contrast at the 
object surface, e.g. due to texture, inscription, shape discontinuities or fig- 
ure/background separation. Small distortions in the imaging process and 
inaccuracies in determining the edge orientation are considered in equa- 
tion (2.3) by parameter 62 , which specifies the upper bound for permissible 
errors. 

Estimation of the Orientation of Gray Value Edges 

In our system, the orientations of gray value edges are extracted by applying 
to the image a set of four differently oriented 2D Gabor functions and com- 
bining the responses appropriately [66, pp. 219-258]. The Gabor function is 
a Gauss-modulated, complex-harmonic function, which looks as follows (2D 
case). 

fv^ip) ■= (-7T • p^ ■ (X>)"^ • p) ■exp(^-i-2-TT ■U'^ ■ p^ (2.4) 

The diagonal matrix T> := diag{ai, (T 2 ) contains the eccentricity values of the 
Gaussian in two orthogonal directions, the vector U := (^ 1 ,^ 2 )^ consists of 
the center frequencies, and i is the imaginary unit. 

The general case of a rotated Gabor function is obtained by 

M K,:= ( '“W ) (2.5) 

Four rotated versions of Gabor functions are defined for ip := 0°, 45°, 90°, 
and 135°, which means that the individual filters respond most sensitive to 
edges whose orientations {i.e. gradient angles) are equal to the rotation angle 
of the filter. The specific choice of filter orientations reveals considerable 
simplifications in successive computations. 

The edge orientation is estimated from the amplitudes X^{p), I^ip), 
T^{p), T^{p) of the complex response of the four filters. These amplitudes 
are multiplied with the cosine of the doubled angle and added up, and this 
procedure is repeated with the sine of the doubled angles. From the two re- 
sults, we compute the arcus tangens but take the quadrant of the coordinate 
system into account, transform the result into an angle which must be dis- 
cretized within the integer set {0, • • • , 179}, and do simple exception handling 
at singularities. 

^ A much simpler approach would combine the responses of just two orthogonally 
directed Sobel filters. However, our Gabor-based approach can be parameterized 
flexibly, and thus the orientation of gray value edges can be estimated more 
accurately. 
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sin(0°) -bX^(p) • 


sin(90°) +X^(p) ■ sin(180°) -b 






^t{p) 


•sin(270°)= X^{p)- 


-^tip) 


(2.7) 


^lip) 


:= —0.5 • arctan (X^ip) ~ '^a{p)^'^\{p) ~ 


-^3ip)) 


(2.8) 


I?{p) 


. f T’--bXf(p) 

■" 1 If (P) 


■ lf(p)<0 
: Xf(p)>0 




(2.9) 


X^{p) : 


:= round ((xf (p)/7r) • 180°) 




(2.10) 



The standard Hough transformation is modified by taking the orienta- 
tion at each edge point into account and accumulating only those small set 
of parameter tuples, for which equation (2.3) holds. A tolerance band S 2 is 
introduced to take the inaccuracy of the edge orientation T^{p) at position 
p into account. The parameter 82 in equation (2.3) correlates to 62 in equa- 
tion (2.11), therefore, in the following we refer only to (52- 

Definition 2.3 (Orientation-selective Hough transformation, OHT) 

The orientation- selective Hough transformation (OHT) of the binary image 
and the orientation image T'^ relative to the polar form f^ of a straight 
line is of functionality Q ^ [0, • • • , {Iw 'Ih)]- The resulting Hough image T^^ 
is defined by 

I^^{q):=#{per\I^{p) = l A \f^{p,q)\< 6 iA 

if>-S2)<I^{p)<i<l> + S2)} ( 2 . 11 ) 

Figure 2.5 shows the resulting Hough image T^^ of the OHT if we assign 
62 = 2° angle degrees for the tolerance band of edge orientations. Compared 
to the Hough image T^^ of the SHT in Figure 2.4 we realize that more local 
maxima are sharpened in the Hough image T^^ of the OHT. 

The local maxima can be obtained iteratively by looking for the global 
maximum, erasing the peak position together with a small surrounding area, 
and restarting the search for the next maximum. Due to the sharpness of the 
peaks in it is much easier (compared to T^^) to control the area size 

to be erased in each iteration. Figure 2.6 shows the extracted lines specified 
by the set of 10 most maximal peaks in the Hough images of SHT and OHT 
(top and bottom, respectively). Obviously, the lines extracted with OHT, 
which consider the line/edge orientation compatibility, are more relevant and 
accurate for describing the object boundary. The line/edge orientation com- 
patibility not only supports the extraction of relevant lines, but is also useful 
for verifying or adjusting a subset of candidate lines, which are determined 
in the context of other approaches (see Section 2.3 and Section 2.4 later on). 
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Fig. 2.5. Orientation-selective Hough transformation of the binarized image in 
Figure 2.3 by taking an image of edge orientations into account. Several local max- 
ima are much more sharpened than in the Hough image of SHT in Figure 2.4. 




Fig. 2.6. Extracted image lines based on 10 most maximal peaks in the Hough 
image of SHT (top) and of OHT (bottom). The lines extracted with OHT are more 
relevant and accurate for describing the object boundary. 
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Experiments to the Line/Edge Orientation Compatibility 

The orientation compatibility is used to verify certain segments of lines, z.e. 
restrict the unbounded lines extracted with OHT to the relevant segments of 
an object boundary. The finite set of discrete points pi,i G of a 

line bounded between pi and pN is denoted by C(j>i,pn)- In Figure 2.7 the 
line segment L{pa,Pd) through the characteristic points {pa,Pb,Pc,Pd\ is only 
relevant between pb and Pc- Figure 2.8 shows the course of edge orientation for 
the points pi located on the line segment C{pa,Pd)- The horizontal axis is for 
the points on the line segment and the vertical axis for the orientations. Fur- 
thermore, the orientation (f> of the line segment L{pa,Pd) is depicted, which 
is of course independent of the points on the line. For the points of the line 
segment L{pb,Pc) we obtain small deviation values between the edge orien- 
tations and the line orientation. On the other hand, there is large variance 
in the set of deviation values coming from edge orientations of the points of 
the line segments £{pa,Pb) or L{pc,Pd)- 




Fig. 2.7. Example line with characteristic points {pa,Pb,Pc,Pd}^ defined by inter- 
section with other lines and with the image border. Just the line segment between 
Pb and Pc is relevant for the boundary. 



For verifying a line segment, we evaluate the deviation between the ori- 
entation of the line and the orientations of the gray value edges of all points 
on the line segment. 

Definition 2.4 (Orientation-deviation related to a line segment) The 

orientation- deviation between orientation (j) of a line and the orientations of 
all edges on a segment L{p\,pn) of the line is defined by 



Dle{(I),£{pi,Pn)) 




(2.12) 




min{|fij|, \di + 180°|, \di - 180°]} 


(2.13) 


90° 


di := 




(2.14) 



The minimization involved in equation (2.13) is due to the restriction of 
edge and line orientation in the angle interval [0°, • • • , 180°], respectively. For 
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Fig. 2.8. Course of edge orientation of points along the line £{pa,Pd), and ori- 
entation (j> of this line. Small deviation within the relevant segment C(pb,Pc), and 
large deviations within the other two segments. 



example, the deviation between a line orientation ^ = 0° and an edge orienta- 
tion = 180° must be defined to be zero. Furthermore, a normalization 

factor is introduced to restrict the deviation values in the real unit interval. 
The orientation-deviation related to the line segments C{pa,Pb), ^{Pb,Pc)j 
and C{pc,Pd) in Figure 2.7 is shown in Figure 2.9. For line segment C{pb,Pc) 
it is minimal, as expected, because this line segment originates from the ac- 
tual boundary of the target object. 

Based on the definition for orientation-deviation, we can formally intro- 
duce the line/edge orientation compatibility between the orientation of a line 
and the orientations of all edges on a segment of the line. 

Assumption 2.2 (Line/edge orientation compatibility for a line seg- 
ment, LEOC) Let S 3 be the permissible orientation- deviation in the sense 
of a necessary geometric/photometric compatibility. The line/edge orienta- 
tion compatibility (LEOC) holds between the orientation f of a line and the 
orientations X'^{pi) of all edges on a segment £{pi,pn) of the line if 

Dle{4>,£{pi,Pn)) < S 3 (2-15) 

For example. Figure 2.9 shows that the line/edge orientation compatibility 
just holds for the line segment C{pb,Pc) if we apply a compatibility threshold 
S 3 = 0.15. For the extraction of object boundaries, it is desirable to have 
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Fig. 2.9. Mean variance of the edge orientations for three line segments C(pa,Pb), 
£(pb,Pc), C{pcPd) related to line orientation <f>. 



threshold S 3 specified such that the line/edge orientation compatibility just 
holds for the line segments which are relevant for a boundary. 

Appropriate values for thresholds (5i and 62 in Definition 2.3, and S 3 in 
Assumption 2.2 must be determined on the basis of visual demonstration 
(see Section 2.5). Furthermore, the parameters T> (Gaussian eccentricity val- 
ues) and U (center frequencies) of the Gabor function in equation (2.4) are 
determined by task-relevant experimentation. 

2.2.3 Junction Compatibility between Pencils and Corners 

The geometric line feature and the photometric edge feature are one-dimensio- 
nal in nature. A further sophisticated compatibility criterion can be defined 
on the basis of two-dimensional image structures. In the projected object 
boundary, usually, two or more lines meet at a common point (see points 
Pb and Pc in Figure 2.7). A collection of non-parallel image line segments 
meeting at a common point is designated as pencil of lines, and the common 
point is designated as pencil point [52, pp. 17-18]. 

At the pencil point a gray value corner should be detected in the image. 
Generally, gray value corners are located at the curvature extrema along edge 
sequences. A review of several corner detectors was presented by Rohr [140], 
however, we used the recently published SUSAN operator [159]. Exemplary, 
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Figure 2.10 shows a set of gray value corners with the corners at the points 
Pb and Pc included. 




Fig. 2.10. The SUSAN operator has detected a set of gray value corners shown as 
black squares. We find all those gray value corners included which are characteristic 
for the three-dimensional object boundary. 



According to this, we must consider a compatibility between the geomet- 
ric pencil feature and the photometric corner feature. Common attributes 
are needed for characterizing a junction of lines and a junction of edge se- 
quences. The term junction is used as generic term both for the geometric 
pencil feature and the photometric corner feature. We define an M -junction 
to consist of M meeting lines, or consist of M meeting edge sequences, re- 
spectively. An M -junction of lines can be characterized by the position ppc 
of the pencil point and the orientations A := {cxi, - ■ ■ ,um) of the meeting 
lines related to the horizontal axis. Similary, an M -junction of edge sequences 
is characterized by the position pcr of the corner point and the orientations 
B := {(ii, - ■ ■ ,Pm) of the meeting edge sequences against the horizontal axis. 

Definition 2.5 (Junction-deviation related to a pencil) The junction- 
deviation between an M -junction of lines with orientations A at the pencil 
point Ppc and an M -junction of edge sequences with orientations B at the 
corner point Pcr is defined by 

Dpc{Ppc,Pcr,A,B) := Ui ■ Djp{ppc,Pcr) + ^2 ' Djo{A,B) (2.16) 

7-1 \ . l|Ppc~ Peril fc 1^7^ 

D J pyPpc'iVcr) ’ — j 

1 ^ 

Djo{A,B) := • ^min{|d,|, \d, + 360°|, |d, - 360°|} (2.18) 

i—1 

d^:=ai-fSi (2.19) 

Equation (2.16) combines two components of junction-deviation with the fac- 
tors u)\ and 0 J 2 , which are used to weight each part. The first component (i.e. 
equation (2.17)) evaluates the euclidean distance between pencil and corner 
point. The second component {i.e. equation (2.18)) computes the deviation 
between the orientation of a line and of the corresponding edge sequence, and 




2.2 Geometric/Photometric Compatibility Principles 



43 



this is done for all corresponding pairs in order to compute a mean value. 
Both components and the final outcome of equation (2.16) are normalized 
in the real unit interval. Based on this definition, we formally introduce a 
pencil/corner junction compatibility. 

Assumption 2.3 (Pencil/corner junction compatibility, PCJC) Let 

S 4 be the permissible junction- deviation in the sense of a necessary geo- 
metric/photometric compatibility. The pencil/corner junction compatibility 
(PCJC) holds between an M -junction of lines and an M -junction of edge 
sequences if 

Dpc{Ppc,Pcr,A,B) <64 ( 2 . 20 ) 

This pencil/corner junction compatibility is used as a criterion to evaluate 
whether a junction belongs to the boundary of a target object. For illustra- 
tion, the criterion is applied to three junctions designated in Figure 2.11 
by the indices 1,2,3. It is remarkable that junctions 1 and 2 belong to the 
boundary of the target object but junction 3 does not. The white squares 
show the pencil points Ppci,Ppc 2 iPpc 3 - They are determined by intersection 
of straight lines extracted via OHT (see also Figure 2.6 (bottom)). The black 
squares show the corner points Pcri,Pcr2,Pcr3- They have been selected from 
the whole set of corner points in Figure 2.10 based on nearest neighborhood 
to the pencil points. Related to equation (2.17), we realize that all three pen- 
cil points have corresponding corner points in close neighborhood. For the 
three junctions 1,2,3 we obtain position deviations (normalized by the im- 
age diagonal Id) of about 0.009, 0.006, 0.008, respectively. According to this, 
a small position deviation is only a necessary but not a sufficient criterion 
to classify junction 3 unlike to 1 and 2. Therefore, the junctions must be 
characterized in more detail, which is considered in equation (2.18). 




Fig. 2.11. A subset of three gray value corners is selected. They are located in 
close neighborhood to three pencil points of relevant boundary lines, respectively. 
Just the corners at jnnctions 1 and 2 are relevant for boundary extraction but not 
the corner at junction 3. 



Characterization of Gray Value Corners 

A steerable wedge filter, adopted from Simoncelli and Farid [158], has been 
applied at the pencil points Ppci,Ppc2,Ppc3 in order to locally characterize 
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the gray value structure. We prefer pencil points instead of the neighboring 
corner points, because the pencil points arise from line detection, which is 
more robust than corner detection. A wedge is rotating in discrete steps 
around a pencil point and at each step the mean gray value within the wedge 
mask is computed. For example, the wedge started in horizontal direction 
pointing to the right and than rotated counter-clockwise in increments of 
4° angle degrees. According to the steerability property of this filter, we 
compute the filter response at the basic orientations, and approximate filter 
responses in between two basic orientations (if necessary). This gives a one- 
dimensional course of smoothed gray values around the pencil point. The 
first derivative of a one-dimensional Gaussian is applied to this course and 
the magnitude is computed from it. We obtain for each discrete orientation 
a significance measurement for the existence of an edge sequence having just 
this orientation and starting at the pencil point. As a result, a curve of filter 
responses is obtained which characterizes the gray value structure around the 
pencil point. 

Experiments to the Pencil/Corner Junction Compatibility 

These curves are shown for the junctions 1, 2, and 3 in Figure 2.12, Fig- 
ure 2.13, and Figure 2.14, respectively. The curve in Figure 2.12 shows two 
local maxima (near to 0° and 360°, respectively, and near to 200°) indicating 
a 2-junction. The curve in Figure 2.13 shows three local maxima (near to 
20°, 290°, and 360°) indicating a 3-junction. The curve in Figure 2.14 shows 
two local maxima (near to 20° and 150°) indicating a 2-junction. The ver- 
tical dotted lines in each figure indicate the orientation of the image lines 
(extracted by Hough transformation), which converge at the three junctions, 
respectively. We clearly observe that in Figure 2.12 and Figure 2.13 the local 
maxima are located near to the orientations of the converging lines. However, 
in Figure 2.14 the positions of the curve maxima and of the dotted lines differ 
significantly as junction 3 does not belong to the object boundary. By ap- 
plying the formula in equation (2.18), we compute about 0.04,0.03,0.83 for 
the three junctions 1,2,3, respectively. Based on a threshold, we can easily 
conclude that the pencil/corner junction compatibility holds for junctions 1 
and 2, but not for junction 3. Appropriate values for 5^ in Assumption 2.3 
and the parameters of the SUSAN corner detector are determined on the 
basis of visual demonstration. 

The line/edge orientation compatibility and the pencil/corner junction 
compatibility can be assumed generally for all scenes containing approximate 
polyhedral objects. In Sections 2.3 and 2.4, the evaluation of the junction 
compatibility according to equation (2.16) is combined with the evaluation 
of the orientation compatibility according to equation (2.12). This measure 
will be used in combination with regularity features of object shapes to define 
the relevance of certain line segments for the boundary description. 
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Fig. 2.12. Response of a counter-clockwise rotating wedge filter applied at junction 
1. The two local maxima indicate a 2-junction, i.e. two converging edge sequences. 
The orientations of the converging edge sequences (maxima of the curve) are similar 
to the orientations of two converging lines (denoted by the positions of the vertical 
lines). The pencil/corner junction compatibility holds for junction 1. 




Fig. 2.13. Response of a counter-clockwise rotating wedge filter applied at junc- 
tion 2. The three local maxima indicate a 3-junction, i.e. three converging edge 
sequences. The maxima of the curve are located near the positions of the vertical 
lines. The pencil/corner junction compatibility holds for junction 2. 
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Fig. 2.14. Response of a counter-clockwise rotating wedge filter applied at junction 
3. The two local maxima indicate a 2-junction. The orientations of the converging 
edge sequences are dissimilar to those of the converging lines. The pencil/corner 
junction compatibility does not hold for junction 3. 



2.3 Compatibility-Based Structural Level Grouping 

The orientation-selective Hough transformation (OHT), the line/edge orien- 
tation compatibility (LEOC) and the pencil/corner junction compatibility 
(PCJC) are the basis for detecting high-level geometric structures in the im- 
age. Additionally, we introduce another principle of geometric/photometric 
compatibility, z. e. the phase compatibility between approximate parallel lines 
and gray value ramps (PRPC). 

These compatibilities between geometric and photometric image features 
are combined with pure geometric compatibilities under projective trans- 
formation. The geometric compatibilities are determined for geometric reg- 
ularity features which are inherent in man-made 3D objects. Approximate 
parallel or right-angled line segments, or approximate reflection-symmetric 
or translation-symmetric polylines are considered in a sophisticated search 
strategy for detecting organizations of line segments. This section focuses 
on the extraction of polygons originating from the faces or the silhouettes of 
approximate polyhedral 3D objects. Related to the aspect of extracting poly- 
gons, our approach is similar to a work of Havaldar et al. [77], who extract 
closed figures, e.g. by discovering approximate reflection-symmetric polylines 
(they call it skewed symmetries between super segments). The principal dis- 
tinction to our work is that we are striving for an integration of grouping 
cues from the structural level with cues from other levels, i.e. from signal 
level and assembly level. 
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2.3.1 Hough Peaks for Approximate Parallel Lines 

It is well-known for the projective transformation of an ideal pinhole camera 
that an image point {x\,X 2 )'^ is computed from a 3D scene point (j/i, ?/ 2 , 
by 

x\ := h ■ — , X 2 ■= b ■ — (2-21) 

2/3 2/3 

Parameter b is the distance between the lens center and the projection plane. 
According to equation (2.21), it is obvious that parallel 3D lines are no longer 
parallel after projective transformation to the image (except for lines parallel 
to the projection plane). 

Fortunately, for certain imaging conditions the parallelism is almost in- 
variant under projective transformation. In order to obtain an impression for 
this, we describe the imaging condition for taking the picture in Figure 2.15. 
The distance between the camera and the target object was about 1000mm, 
and the lens of the objective was of 12mm focal length. In this case the de- 
viation from parallelism, which depends on object orientation, is at most 6° 
angle degrees. We formulate the parallelism compatibility and relate it to the 
configuration of Hough peaks. 




Fig. 2.15. Arrangement of objects in a scene of electrical scrap. The black dummy 
box is our target object for the purpose of demonstration. It is located in a complex 
environment, is partially occluded, and has a protrusing socket. 



Definition 2.6 (Approximate parallel lines) Let 65 be the permissible 
deviation from parallelism, i.e. maximal deviation from exact regularity. Two 




48 



2. Compatibilities for Object Boundary Detection 



image lines with values (j>i and 4>2 of the angle parameter are approximate 
parallel if 

Do{<fi,h)<S5 (2.22) 

Assumption 2.4 (Parallelism compatibility) The parallelism compati- 
bility holds if parallel lines in 3D are approximate parallel after projective 
transformation. For such imaging conditions, parallel lines in 3D occur as 
peaks in the Hough image being located within a horizontal stripe of height 

5b- 



The peaks in a Hough stripe describe approximate parallel lines in the gray 
value image. Figure 2.16 shows the Hough image obtained from the gray value 
image in Figure 2.15 after binarization (edge detection) and application of 
the OHT. 




Fig. 2.16. Hough image obtained after edge detection in the image of Figure 2.15. 
A set 55 most maximal peaks is marked by black dots. They have been organized 
in 10 clusters (horizontal stripes) using the ISODATA clustering algorithm. 



The Hough image has been edited with black squares and horizontal lines, 
which mark a set of 55 local maxima organized in 10 horizontal stripes. The 
local maxima are obtained using the approach mentioned in Subsection 2.2.2. 
For grouping the peak positions, we solely take parameter (j) into account and 
use the distance function Dq from equation (2.13). According to this, angles 
near to 0° can be grouped with angles near to 180°. A procedure similar 
to the error-based ISODATA clustering algorithm can be applied but taking 
the modified distance function into account [149, pp. 109-125]. Initially, the 
algorithm groups vectors (in this application, simply scalars) by using the 
standard K-means method. Then, clusters exhibiting large variances are split 
in two, and clusters that are too close together are merged. Next, K-means 
is reiterated taking the new clusters into account. This sequence is repeated 
until no more clusters are split or merged. The merging/splitting parameters 
are taken in agreement with the pre-specified 5b from Assumption 2.4. 

For example. Figure 2.17 shows the set of approximate parallel lines spec- 
ified by the Hough peaks in the fourth stripe, and Figure 2.18 shows it for 
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the eighth stripe of Hough peaks. Under these lines we find candidates for 
describing the boundary of the dummy box. 




Fig. 2.17. The set of approximate parallel lines specified by the Hough peaks in 
the fourth stripe of Figure 2.16. 



The next subsection introduces a criterion for grouping approximate par- 
allel lines which are supposed to belong to the silhouette of an object in the 
image. It is the phase compatibility between approximate parallel lines and 
gray value ramps (PRPC). 



2.3.2 Phase Compatibility between Parallels and Ramps 

In a Fourier-transformed image the phase plays a much greater role for ex- 
hibiting the relevant image structure than the amplitude [85]. It is a global 
image feature in the sense of taking the whole image into account. Instead, 
the local phase is an image feature for characterizing image structures locally 
[66, pp. 258-278]. In addition to describing a gray value edge by the gradient 
angle and magnitude, it is interesting to distinguish various types of edges, 
e.g. roof and ramp image structures. For example, the boundary edges of a 
homogeneous surface can be classified as ramps, and the edges of inscription 
strokes are more of the roof type. A further distinction can be made whether 
the ramp is from left to right or vice versa, and whether the roof is directed 
to top or bottom. The four special cases and all intermediate situations can 
be arranged on a unit cycle in the complex plane and the distinction is with 
the polar angle (p. The specific values p := 0° or p := 180° represent the 
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Fig. 2.18. The set of approximate parallel lines specified by the Hough peaks in 
the eighth stripe of Fignre 2.16. 



two roof types, and := 90° or := 270° represent the two ramp types (see 
Figure 2.19). 




Fig. 2.19. Representation of local phase as a vector in the complex plane where 
the argument reflects the roof and ramp relationship (eveness and oddness) of 
edge characterization. 
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Local Phase Computation with Gabor Functions 

For characterizing the type of gray value edges we can apply once more the 
Gabor function, which has already been used in Section 2.2 for estimating 
edge orientation. The Gabor function is a so-called analytic function, in which 
the imaginary part is the Hilbert transform of the real part. The real part 
is even and is tuned to respond on roof edges, and the imaginary part is 
odd and is tuned to respond on ramp edges. The local phase of a Gabor- 
transformed image actually reveals the edge type at a certain point, i.e. 
the polar angle if. However, the local phase is a one-dimensional construct 
which should be determined along the direction of the gradient angle (edge 
orientation). According to this, for characterizing an edge one must first apply 
four rotated Gabor functions (0°, 45°, 90°, 135°) and determine from the local 
amplitudes the edge orientation, and second apply another Gabor function in 
the direction of the estimated edge orientation and determine from the local 
phase the edge type. 

Experiments to Local Phase Computation 

For example, let us assume a simulated image with a region Ri of homoge- 
neous gray value gi and an environmental region i ?2 of homogeneous gray 
value (/ 2 - If the gradient angles (in the interval [0°, • • • , 360°]) at the boundary 
points of Ri are considered for computing the local phases in the relevant 
directions, then a constant local phase value ^ reveals for any point of the 
boundary. However, we restricted the computation of edge orientation to the 
interval [0°, • • • , 180°], and in consequence of this, the local phase takes on 
either | or — Specifically, if region Ri is a parallelogram then the local 
phases for pairs of points taken from opposite parallel lines have different 
signs. 

The change of sign in the local phase can also be observed for pairs of 
approximate parallel boundary lines from the silhouette of the dummy box. 
Figure 2.20 illustrates two similar cases in the left and right column. The 
top picture shows two parallel boundary lines which are exemplary taken 
from the picture in Figure 2.6 (bottom). In the middle two short lines are 
depicted which cross the boundary lines and are approximate orthogonal to 
them. The local phases are computed (based on edge orientation) for the 
set of points from these short lines instead of just the one boundary point. 
This is to obtain an impression for the sensitivity of local phase computation 
in the neighborhood of the relevant point.® The bottom diagram shows the 
two courses of local phases computed for the two sets of line points from 
the middle picture. The local phases of one line are located in the interval 
[0, • • • , 7 t] and of the other line in the interval [— tt, • • • , 0] . 



Log-normal filters are proposed to treat this problem [66, pp. 219-288]. 
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Fig. 2.20. Dummy box, two pairs of parallel lines, phase compntation. 



In Figure 2.21, the picture at the top contains again the dummy box, an 
artificial vertical line segment through the box region, and three dots where 
the line segment crosses three boundary lines (the boundary lines are not 
depicted) . The diagram at the bottom shows the course of local phases for all 
points of the line segment from top to bottom. Furthermore, the mentioned 
crossing points in the picture are indicated in the diagram at specific positions 
of the coordinate axis (see positions of vertical lines in the diagram). The 
local phase at the first and second crossing point is negative, and at the third 
point positive, and the values are approximately and +^, respectively. 
Beyond these specific points, z.e. in nearly homogeneous image regions, the 
local phases are approximately 0. 

Local Phase Computations at Opposite Boundary Lines 

Based on this observation, it is possible to formulate a necessary criterion for 
grouping silhouette lines. Two approximate parallel line segments belong to 
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Coo rd ii no t e 



Fig. 2.21. Dummy box, vertical line and crossing points, course of local phases. 

the silhouette of an object if gray value ramps at the first line segment are 
converse to the gray value ramps at the second line segment. It is provided 
that local phases are determined along the direction of the gradient of gray 
value edges. Furthermore, the gray values of the object must be generally 
higher or generally lower than the gray values of the background. Let L,\ and 
£2 be two approximate parallel line segments. For all points of L,\ we com- 
pute the local phases and take the mean, designated by function application 
/P^(£i), and the same procedure is repeated for £ 2 , which is designated by 
/p'*(£ 2 ). Then, we define a similarity measure between phases such that the 
similarity between equal phases is 1, and the similarity between phases with 
opposite directions in the complex plane is 0 (see Figure 2.19). 

|/p'*(A)-/p'*(£2)| 



Dpr{Ci,C2) : 



7T 



(2.23) 
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The usefulness of equation (2.23) can be illustrated by various examples 
of line segment pairs in a simulated image.® The image may consist of a 
black rectangle R\ and a white environment i?2- For two line segments taken 
from opposite sides of R\ the mean phases are — | and +|, respectively, 
and therefore Dpn = 0. For two line segments taken from one side of R\ 
the mean phases are equal, i.e. — § or +-|, and therefore Dpp = 1. For 
two line segments taken from the interior of region R\ the mean phases are 
0, respectively, and therefore again Dpp = 1. If one segment is taken from 
one side of R\, i.e. mean phase is — ^ or +-|, and the other line segment 
taken from the interior of region R\, i.e. mean phase is 0, then Dpp = 0.5. 
According to this discussion, for line segments taken from opposite sides of a 
silhouette the measure reveals 0, and in the other cases the measure is larger 
than 0. Generally, the value of measure Dpp is restricted in the unit interval. 
Based on this definition, we can formally introduce a phase compatibility 
between approximate parallel lines and gray value ramps. 

Assumption 2.5 (Parallel/ramp phase compatibility, PRPC) R is 

assumed that two line segments L\ and £2 are approximately parallel accord- 
ing to equation (2.22). Furthermore, the gray values between the two segments 
should he generally higher or alternatively lower than beyond the segments. 
Let be a threshold for the necessary parallel/ramp phase compatibility. The 
parallel/ramp phase compatibility (PRPC) holds between a pair of approxi- 
mate parallel lines if 

Dpp{Ci,C2)<S>i (2.24) 

Appropriate task-relevant thresholds 65 and can be determined on the 
basis of visual demonstration (see Section 2.5).^ 

In the next sections, the horizontal clusters of Hough peaks are used in 
combination with the LEOC, PCJC, and PRPC principles for extracting the 
faces or silhouettes of objects (or merely approximate faces or silhouettes) 
from the images. 

2.3.3 Extraction of Regular Quadrangles 

In a multitude of man-made objects the faces or the silhouettes can be ap- 
proximated by squares, rectangles, or trapezoids. The projective transforma- 
tion of these shapes yields approximations of squares, rhombuses, rectangles, 
parallelograms, or trapezoids. A generic procedure will be presented for ex- 
tracting from the image these specific quadrangles.® Under the constraint of 

® The extraction of line segments from lines in real images is treated afterwards. 

^ In Section 2.5, we present an approach of local phase estimation which becomes 
more robust if phase computation is extended to a small environment near the 
line segment. The respective consideration of environmental points is for reducing 
the sensitivity of local phase computation (see above). 

® The next subsection, after this one, presents a procedure for extracting more 
general polygons. 
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clustered Hough peaks, we exhaustively look for quadruples of Hough peaks, 
extract the four line segments by line intersection, respectively, apply the 
LEOC, PCJC, and PRPC principles, and determine deviations from a cer- 
tain standard form. For each quadrangle all evaluations are combined, which 
results in a saliency value including both photometric and geometric aspects. 

Geometric/Photometric Compatibility for Quadrangles 

We have introduced the orientation-deviation D^e related to a line segment 
in equation (2.12) and the junction-deviation Dpc related to a pencil of line 
segments in equation (2.16). To extend the LEOC and PCJC principles to 
quadrangles, we simply average these values for the four segments and for 
the four pencils, respectively. 

4 4 

DlE.QD ■= , Dpc_QD ■= 2 ''^^PC-i (2.25) 

i=l i=l 

For convenience, we omitted the parameters and simply introduced an 
index for the line segments involved in a quadrangle. The resulting func- 
tions DpE_QD and Dpc_qd can be used in combination to define a geomet- 
ric/photometric compatibility for quadrangles. 

Assumption 2.6 (Geometric/photometric compatibility for a quad- 
rangle) The necessary geometric/photometric compatibility for a quadrangle 
is specified by parameter Sj. The geometric/photometric compatibility for a 
quadrangle holds, if 

{Dle_ qd + DpC- qd) < dy (2.26) 

For specific quadrangle shapes with one or two pairs of approximate paral- 
lel lines, the PRPC principle can be included in Assumption 2.6. The left hand 
side of equation (2.26) is extended by a further summation term Dpp_qe 
which describes the parallel/ramp phase compatibility related to a specific 
quadrangle having approximate parallel lines. 

Geometric Deviation of Quadrangles from Standard Forms 

In order to consider the pure geometric aspect, we define the deviation of 
a quadrangle from certain standard forms. For the sequence of four line 
segments of a quadrangle, let H := be the lengths, Q := 

(71 5 72 j 73 , 74 ) be the inner angles of two successive segments, and T := 
{ 4 > 1 t 4 > 2 , 4 > 3 , 4 >a) be the orientation angles of the polar form representations. 
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Definition 2.7 (Rectangle-deviation, parallelogram-deviation, squa- 
re-deviation, rhombus-deviation, trapezoid-deviation) 



The rectangle- deviation of a quadrangle is defined by 

( 2 - 27 ) 

i=l 

The parallelogram- deviation of a quadrangle is defined by 

Dpa{Q) ■■= ^ • (|7i - 7s| + \l2 - 74|) (2.28) 

The square-deviation of a quadrangle is defined by 

Dsq{G,H) := i • (DpciG) + Vsl{H)) (2.29) 

with the normalized length variance Vsl{TL) of the four line segments. 

The rhombus- deviation of a quadrangle is defined by 

Drh{G,H) := i • {Dpa{G) + Vsl{U)) (2.30) 

The trapezoid- deviation of a quadrangle is defined by 

Dtr{T) := min{£)o(^i,«i'3),£>o((('2,(('4)} (2.31) 



Normalization factors are chosen such that the possible values of each func- 
tion fall in the unit interval, respectively. For the rectangle-deviation the mean 
deviation from right-angles is computed for the inner angles of the quad- 
rangle. For the parallelogram-deviation the mean difference between diago- 
nally opposite inner angles is computed. The square-deviation and rhombus- 
deviation are based on the former definitions and additionally include the 
variance of the lengths of line segments. The trapezoid-deviation is based on 
equation (2.13) and computes for the two pairs of diagonally opposite line 
segments the minimum of deviation from parallelism. 

The features related to the geometric/photometric compatibility for quad- 
rangles, and the feature related to the geometric deviation of quadrangles 
from specific shapes must be combined to give measures of conspicuity of 
certain shapes in an image. 

Definition 2.8 (Saliency of specific quadrangles) The saliency of a spe- 



cific quadrangle is defined by 




Asp_qd '■= 


= 1 ^QD ^ with 


(2.32) 




OJi -\- 0J2 OJ 3 -h W 4 




Dqd ■= 


• T>pe_ QD + <^2 ■ Dpc_ QD + 






W 3 • DpR_ QD -1- W 4 • T>SP_ QD 


(2.33) 
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The specific quadrangle can be an approximate rectangle, parallelogram, 
square, rhombus, or trapezoid, and for these cases the generic function symbol 
DsP- QD must be replaced by Z?_rc, Dpa, Dsq, Drh, or Dtr, as introduced 
in Definition 2.7. 

Generic Procedure PEx for the Extraction of Specific Quadrangles 



Procedure PE\ 



1. For each pair of cluster stripes in the set of Hough peaks: 

1.1. For each pair of Hough peaks in the first stripe: 

1.1.1. For each pair of Hough peaks in the second stripe: 

1.1. 1.1. Intersect the lines specified by the four Hough peaks and con- 
struct the quadrangle. 

1.1. 1.2. Compute the mean line/edge orientation-deviation using 
function Dle_qd- 

1.1. 1.3. Compute the mean pencil/corner junction-deviation using 
function D pc_ q d ■ 

1.1. 1.4. Compute the mean parallel/ramp phase-deviation using func- 
tion Dpp_ Qp. 

1.1. 1.5. Compute the deviation from the specific quadrangle using 
function Dsp_qd- 

1.1. 1.6. Compute the saliency value by combining the above results 
according to equation (2.32). 

2. Bring the specific quadrangles into order according to decreasing 
saliency values. 



The generic procedure works for all types of specific quadrangles which 
have been mentioned above, except for trapezoids. For the extraction of trape- 
zoids the algorithm can be modified, such that it iterates over single cluster 
stripes, and selects all combinations of two Hough peaks in each stripe, re- 
spectively, and takes the third and fourth Hough peaks from any other cluster 
stripes. Thus, it is considered that a trapezoid just consists of one pair of par- 
allel line segments, instead of two such pairs of the other specific quadrangles. 

Experiments to the Extraction of Regular Quadrangles 

These procedures have been applied to complicated scenes of electrical scrap 
in order to draw conclusions concerning the usefulness of the principles intro- 
duced above. The goal of the following experiments was to extract from the 
images specific quadrangles which describe the faces or silhouettes of objects. 

First, we applied the procedure to the image in Figure 2.15 with the inten- 
tion of extracting approximate parallelograms. In the saliency measure, the 
weighting factors wi, W 2 , W 3 , W 4 are set equal to 0.25. Figure 2.22 shows exem- 
plary a set of 65 approximate parallelograms, which are best according to the 
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saliency measure. The decreasing course of saliency values for the approxi- 
mate parallelograms is depicted in Figure 2.23. The silhouette boundary of 
the dummy box (see Figure 2.24) is included as number three. For automat- 
ically detecting the dummy box, z. e. determining the reference number three 
as the relevant one, it is necessary to apply object recognition. In Chapter 
3, we present approaches for object recognition, which can be applied within 
the areas of a certain set of most salient quadrangles. 

Second, the procedure for boundary extraction is used to extract approx- 
imate rectangles with the intention of extracting the electronic board of a 
computer interior in Figure 2.25. The best set of 10 approximate rectangles 
are outlined in black color, including the relevant one of the board. 

Third, the procedure for boundary extraction was applied to extract ap- 
proximate rhombuses with the goal of locating the electronic board in the 
image of Figure 2.26. The best set of 3 approximate rhombuses are outlined 
in white color, including the relevant one. 

Finally, the extraction of approximate trapezoids is shown in Figure 2.27. 




Fig. 2.22. Based on a saliency measure a subset of most conspicuous, approximate 
parallelogramms have been extracted. 
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Fig. 2.23. Decreasing course of saliency values for the subset of extracted paral- 
lelograms in the image of Figure 2.22. 




Fig. 2.24. The silhouette boundary of the dummy box is included in the set of 
most conspicuous, approximate parallelogramms. 
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Fig. 2.25. Electronic board of a computer interior and an extracted subset of 
approximate rectangles. One of these rectangles represents the boundary of the 
board. 




Fig. 2.26. Computer interior containing an electronic board. One of the extracted 
approximate rhombuses represents the boundary of the board. 
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Fig. 2.27. Image of a loudspeaker and a subset of extracted approximate trape- 
zoids. One of these represents a side-face of the loudspeaker. 



2.3.4 Extraction of Regular Polygons 

The measures Dle and Dpc involved in the geometric/photometric compat- 
ibility of single line segments can easily be extended to polygons of K line 
segments. 

1 ic ^ K 

Dle. PG ■= Dle.2 , Dpc_PG '■= -^^Dpc_i (2.34) 

i=l i=l 

Number K is set a priori which may be known from the task sepcification. 

Assumption 2.7 (Geometric/photometric compatibility for a poly- 
gon) The necessary geometric/ photometric compatibility for a polygon is 
specified by parameter Ss- The geometric/photometric compatibility for a poly- 
gon holds if 

{Dle_ pg + Dpc_ pg) < 1^8 (2.35) 

These compatibility features must be combined with pure geometric fea- 
tures of the polygon. By considering specific polygon regularities, a measure- 
ment of conspicuity is obtained, i.e. a saliency value for the polygon. More 
complex regularities are interesting than the simple ones involved in specific 
quadrangles. For polygons with arbitrary segment number, we define three 
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types of regularities, which are general in the sense that they typically ap- 
pear in scenes of man-made objects. Two types among the regularities are 
based on symmetries between polylines, i.t. the reflection-symmetry and the 
translation-symmetry. A reflection-symmetry is a pair of polylines in which 
each one can be obtained by reflecting the other one at an axis. A translation- 
symmetry is a pair of polylines in which each one can be obtained by reflecting 
the other one sequentially at two parallel axes (Figure 2.28). Approximate 
reflection-symmetric and translation-symmetric polylines appear exemplary 
in the image of a computer monitor (see in Figure 2.29 the two polygons 
outlined with white color). 




Fig. 2.28. Constructing a translation of polylines by a two-step reflection at two 
parallel axes. 




Fig. 2.29. Computer monitor with two types of regularities of the polygonal faces. 
The side face is approximate reflection-symmetric, and the top face is approximate 
translat ion- symmetric. 
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The third type of regularity is a right-angle, i.e. two successive line seg- 
ments of a polygon are right-angled, respectively. For example. Figure 2.30 
shows an electronic board with approximate right-angled shape. 




Fig. 2.30. Computer interior with an electronic board, and a hexagonal boundary. 
The pencils of lines of the board hexagon are approximate right-angles. 



Regularities of Polygons 

The basic component for describing polygon regularities is a polyline. It is 
a non-closed and non-branching sequence of connected line segments. We 
construct for each polygon a pair of non-overlapping polylines with equal 
numbers of line segments. A polygon with an odd number of line segments is 
the union of two polylines and one single line segment (see in Figure 2.31). 
For a polygon with an even number of line segments we distinguish two cases. 
First, the polygon can be the union of two polylines, i.e. they meet at two 
polygon junctions (see in Figure 2.31, middle). Second, the polygon can be 
the union of two polylines and of two single line segments located at the end 
of each polyline, respectively (see Figure 2.31, bottom). 

Let Q := ( 7 i,---, 7 if) be the ordered sequence of inner angles of the 
polygon. A candidate pair of polylines is represented by := ( 7 J, • • • , 7 ^) 
and := ( 7 ^, • • • , 7 ^), i.e. the sequence of inner angles related to the first 
polyline and the opposite (corresponding) inner angles related to the second 
polyline. All angles or •~{l are contained in Q. Included in and are 
the inner angles at the end of the polylines, where one polyline meets the 
other one or meets a single line segment. There are different candidate pairs 
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Fig. 2.31. Organization of reflection-symmetric polygons by two polylines with 
equal number of line segments and up to two single line segments. 



of polylines of a polygon, i.e. the pair of angle sequences is just one 

element in a whole set which we designate by Qs- For example, Figure 2.31 
(bottom) shows for one candidate pair of polylines the two corresponding 
sequences of inner polygon angles. 

Definition 2.9 (Refiection-symmetric polygon) A polygon is reflection- 
symmetric if a pair of polylines exists with sequences and of inner 
angles such that drs(sfl,Ji) ~ tuple (sfi,Ji),i G {1> • ’ ’ ) k}, 

drsilh!) hi - (2.36) 

Figure 2.31 shows three reflection-symmetric polygons (left column) and 
the relevant configuration of polylines and single line segments (right column) . 
Obviously, for all three polygons there exist a vertical axis of reflection for 
mapping one polyline onto the other. Examplary, in the bottom polygon the 
following equations hold, 

12 12 12 
7i = 7i . 72 = 72 > 73 = 73 



(2.37) 
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Definition 2.10 (Translation-symmetric polygon) A polygon is trans- 
lation-symmetric if a pair of polylines exists with sequences and of 
inner angles such that = 0 ° for each tuple ( 7 (^, 7 ^), i & { 1 , • • • , k}, 



dtsi'Ji ) 7i ) 



I 7 I - (360° -7^)1 : ie{2,---,k-l} 

| 7 i_ (180° - 72)1 : ie{l,k} 



(2.38) 



For the translation-symmetry, it is plausible to match the inner poly- 
gon angles ( 7 ^^) of the first polyline with the corresponding exterior angles 
(360° — 7 ?) of the second polyline. The corresponding angles at the end of 
the two polylines, respectively must be matched modulo 180° angle degrees. 
Figure 2.32 shows a hexagon with translation-symmetry and the relevant pair 
of polylines. We can imagine a translation vector for mapping one polyline 
onto the other. The property of translation-symmetry is easily realized by 

7 i' = (180°-7?), 72' = (360°-7|), 73 ' = (180° - 7 I) (2.39) 




Fig. 2.32. Organization of a translation-symmetric hexagon by two polylines and 
two single line segments. 



For a polygon the property of reflection-symmetry or translation-sym- 
metry is examined by determining whether there is a pair of polylines for 
which the proposition in Definitions 2.9 and 2.10 holds, respectively. For this 
purpose, one has to evaluate all possible pairs of polylines by applying a par- 
allel or sequential algorithm. If there is no relevant pair, then the polygon is 
not regular concerning reflection- or translation-symmetry. Figure 2.33 shows 
two polygons with inappropriate pairs of polylines, although the reflection- 
symmetry and the translation-symmetry respectively has already been real- 
ized in Figure 2.31 (bottom) and Figure 2.32. 

For a polygon the deviation from reflection-symmetry and translation- 
symmetry respectively is defined by matching for all candidate pairs of poly- 
lines the two sequences of inner angles and taking the minimum value. 
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Fig. 2.33. Polygons with inappropriate organizations of pairs of polylines. The 
reflection-symmetry and the translation-symmetry have been verihed based on ap- 
propriate organizations in Fignre 2.31 (bottom) and Fignre 2.32, respectively. 



Definition 2.11 (Deviation from reflection-symmetry or transla- 
tion-symmetry) 

For a polygon the deviation from reflection- symmetry is 



Drs ■= min 
Ge 



1 



k ■ 360° 









For a polygon the deviation from translation- symmetry is 



(2.40) 



Dts ■= min 
Ge 



1 

k ■ 360° 






(2.41) 



For the reflection-symmetric polygons in Figure 2.31 the equation Dus = 
0 is obtained, and for the translation-symmetric polygon in Figure 2.32 the 
equation Dts = 0- However, for the left face of the computer monitor 
in Figure 2.29 we compute D^s ~ 0.064, which indicates an approximate 
reflection-symmetry. For the top face of the computer monitor we compute 
Dts ~ 0.015, which indicates an approximate translation-symmetry. 



Definition 2.12 (Approximate reflection-symmetric or approximate 
translation-symmetric polygons) Let Sg and <5io be the permissible devia- 
tions from exact reflection- and translation- symmetry, respectively. A polygon 
is approximate reflection-symmetric if D ns < Sg. A polygon is approximate 
translation- symmetric if Dts < <5io • 





2.3 Compatibility-Based Structural Level Grouping 



67 



Finally, we consider the right-angled polygon as a third type of typical 
regularity in man-made objects. In the special case of convex right-angled 
polygons, the shape is a rectangle. In general, the polygon can include con- 
cavities with the inner polygon angle 270°. 

Definition 2.13 (Right-angled polygon, right-angle deviation) A po- 
lygon is right-angled if for every pencil of two line segments the inner polygon 
angle is given by dra{li) = 0°, with 

drain) ■■= min{|7, - 90°|, \n ~ 270°|} (2.42) 

The right-angle deviation of a polygon with M pencils of two line segments 
repectively, i.e. a polygon with M line segments, is defined by 

1 ^ 

Dra{G) := ^ ^ ggQo ■ ’^^rain) (2.43) 

i—1 

Definition 2.14 (Approximate right-angled polygon) Let <5n be the 

permissible deviation from right-angled polygons. A polygon is approximate 
right-angled if Dra < <5ii. 

Based on these definitions, we introduce three compatibilities under pro- 
jective transformation. For the imaging conditions in our experiments a 
threshold value Sg = i5io = <jii =0.1 proved as appropriate. The compat- 
ibilities will be used later on for extracting regular polygons from the image. 

Assumption 2.8 (Refiection-symmetry compatibility) The reflection- 
symmetry 

compatibility for a polygon holds, if a reflection- symmetric 3D polygon is 
approximate reflection-symmetric after projective transformation. 

Assumption 2.9 (Translation-symmetry compatibility) The transla- 
tion-symmetry compatibility for a polygon holds if a translation- symmetric 3D 
polygon is approximate translation- symmetric after projective transformation. 

Assumption 2.10 (Right-angle compatibility) The right-angle compat- 
ibility for a polygon holds if a right-angled 3D polygon is approximate right- 
angled after projective transformation. 

The projective transformation of 3D object faces, which are supposed 
to be reflection-symmetric, translation-symmetric or right-angled polygons, 
yields approximations of these specific polygons in the image. The features 
related to the geometric deviation of polygons from these specific shapes must 
be combined with features related to the geometric/photometric compatibil- 
ity for polygons. This gives a measure of conspicuity of specific polygons in 
an image. 
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Definition 2.15 (Saliency of specific polygons) The saliency of a spe- 



cific polygon is defined by 




^SP_ PG := 1 , , , , with 

Wi -1- W2 + W 3 + W4 


(2.44) 


DpG ■= OJi ■ DpE_ PG + <^2 • Dpc_ PG + 




W3 ■ Dpp_ PG + UJi- Dsp_ PG 


(2.45) 



The function symbol Dsp_pg must be replaced by Dps^ Dps, or DpA, 
depending on whether there is interest in approximate reflection-symmetric, 
translation-symmetric or right-angled polygons. 

Especially, in the case of reflection-symmetric polygons it makes sense to 
apply the principle of parallel/ramp phase compatibility, which is included 
in equation (2.44) by the term Dpp_po- For other shapes the relevant pa- 
rameter W 3 is set to 0. 

Generic Procedure PE 2 for the Extraction of Specific Polygons 



Procedure PE 2 



1. From the whole set of combinations of three Hough peaks: 

1.1. Select just the combinations under the constraint that first and 
third Hough peak don’t belong to the same cluster as the second 
peak. 

1.2. Determine for each combination a line segment by intersecting 
first and third line with the second one (specified by the Hough 
peaks, respectively). 

1.3. Select the line segments, which are completely contained in the 
image, and are not isolated. 

1.4. Compute the line/edge orientation-deviation using function Dpp, 
and the pencil/corner junction-deviation using function Dpc, 
and select those line segments, for which both the LEOC and 
PCJC principles hold. 

2. Compute a graph representing the neighborhood of line segments, 
i.e. create a knot for each intersection point and an arc for each 
line segment. 

3. Compute the set of minimal, planar cycles in the graph, i.e. min- 
imal numbers of knots and no arc in the graph is intersecting the 
cycles. This gives a candidate set of polygons representing faces of 
an object. 
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Procedure PE 2 , continued 



4. For each polygon: 

4.1. Compute the mean line/edge orientation-deviation using function 
Dle. pg- 

4.2. Compute the mean pencil/corner junction-deviation using func- 
tion Dpc_ PG- 

4.3. Compute the mean parallel/ramp phase-deviation using function 
DpR- PG- 

4.4. Compute the deviation from a specific regularity using generic 
function Dsp_pg- 

4.5. Compute the saliency value by combining the above results ac- 
cording to equation (2.44). 

5. Bring the specific polygons into order according to decreasing 
saliency values. 



Experiments to the Extraction of Regular Polygons 

This generic procedure has been applied successfully for localizing regular 
polygons which originate from the surfaces of man-made objects. For exam- 
ple, the side and top face of the computer monitor in Figure 2.29 have been 
extracted. They were determined most saliently as approximate reflection- 
symmetric and approximate translation-symmetric octagons, respectively. As 
a second example, the boundary of the electronic board in Figure 2.30 has 
been extracted. It was determined most saliently as approximate right-angled 
hexagon. 

Further examples are presented in the next section in the framework of 
extracting arrangements of polygons. For example, the complete arrangement 
of polygons for the computer monitor will be determined by extracting and 
slightly adjusting the polygons of the side, top, and front faces under the 
consideration of certain assembly level constraints. 



2.4 Compatibility-Based Assembly Level Grouping 

The extraction of regular polygons can be considered as an intermediate step 
of the higher goal of localizing certain objects and describing their boundaries 
in more detail. Geometric/photometric compatible features have been com- 
bined with geometric regularity features for defining a saliency measure of 
specific polygons in the image. A salient polygon may arise from the bound- 
ary of a single object face or of a whole object silhouette. In general, it is 
assumed that the surface of a man-made 3D object can be subdivided in 
several faces and from these only a subset is supposed to be observable, e.g. 
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in the case of a parallelepiped just 3 plane faces are observable from a non- 
degenerate view point. The projective transformation of this kind of object 
surface should yield an arrangement of several polygons. We introduce two 
assembly level grouping criteria, i.e. the vanishing-point compatibility and 
the pencil compatibility, which impose imperative restrictions on the shape of 
an arrangement of polygons. These constraints are directly correlated to the 
three-dimensional (regular) nature of the object surface. The principles are 
demonstrated for objects of roughly polyhedral shape, i.e. local protrusion, 
local deepening, or round corners are accepted. 



2.4.1 Focusing Image Processing on Polygonal Windows 

The compatibilities and regularities, used so far, just take basic principles of 
image formation and qualitative aspects of the shape of man-made objects 
into account. Although only general assumptions are involved, various exper- 
iments have shown that the extracted polygons are a useful basis for applying 
techniques of detailed object detection and boundary extraction. For exam- 
ple, Figure 2.30 showed an electronic board which has been extracted in a 
cluttered environment as an approximate right-angled hexagon. Subsequent 
image processing can focus on the hexagon image window for detecting spe- 
cific electronic components on the board. As another example. Figure 2.22 
showed objects of electrical scrap and a set of extracted approximate parallel- 
ograms, among which the rough silhouette boundary of the dummy box was 
included (see Figure 2.24). Subsequent image processing can focus on this 
parallelogram image window for extracting a detailed boundary description. 

Polygons for the Approximation of Depicted Object Silhouettes 

This section concentrates on detailed boundary extraction, and in this con- 
text, the previously extracted polygons serve a further purpose. We need to 
examine the type of geometric shape which is bounded by a polygon, in order 
to apply a relevant approach for detailed boundary extraction. The task of 
roughly characterizing the object shape {e.g. polyhedral or curvilinear) can 
be solved by taking a further constraint into account. According to the prin- 
ciples underlying the procedure of extracting polygons (in Section 2.3), it is 
reasonable to assume that there are polygons included, which approximately 
describe the silhouette of interesting objects. This has also been confirmed 
by the polygons extracted from the images in Figure 2.24 and Figure 2.30. 

Assumption 2.11 (Silhouette approximation by salient polygons) 

For the set of interesting objects, depicted in an image, there are salient 
polygons which approximate the silhouettes with a necessary accuracy 612 . 
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Object Recognition in Polygonal Image Windows 

Based on this assumption, it is expected that a large part of the polygon 
closely touches the object silhouette. Therefore, the gray value structure of 
the interior of a polygon belongs mainly to the appearance of one object and 
can be taken into account in various approaches of object or shape classifica- 
tion. For example, in some applications a simple histogram-based approach is 
appropriate, as was inspired by the work of Swain and Ballard [164]. An ob- 
ject is represented by taking several views from it, and computing histograms 
of gray values, edge orientations, corner properties, cooccurrence features, or 
further filter responses. In an offline phase a set of objects with relevant shapes 
is processed and the histograms stored in a database. In the online phase his- 
tograms are computed from the interior of a polygon, and matched with the 
database histograms. Based on the criterion of highest matching score the 
type of object shape must be determined in order to apply the relevant ap- 
proach of boundary extraction, e.g. extraction of arrangements of polygons or 
alternatively curvilinear shapes. For example, from the gray value structure 
in the quadrangle image window in Figure 2.24 it has been concluded that 
the extraction of a detailed arrangement of polygons is reasonable. We have 
mentioned the approach of classification (which uses histograms) just briefly. 
For more complicated applications the simple approach is unsufflcient and an 
advanced approach for classification is needed, e.g. see Matas et al. [107]. A 
detailed treatment is beyond the scope of this chapter. 

Windowed Orientation- Selective Hough Transformation 

The Figure 2.24 serves to demonstrate the principles underlying a generic 
procedure for extracting arrangements of polygons. Although the dummy 
object is located in a cluttered scene, the silhouette quadrangle is acting 
as a window and the procedure of boundary extraction is hardly detracted 
from the environment or background of the object. According to this, we 
introduce the windowed orientation-selective Hough transformation (WOHT) 
which just considers the image in a polygonal window. The definition is quite 
similar to that of OHT (see Definition 2.3), except that the votes are only 
collected from a subset Vs of coordinate tuples, taken from the interior of the 
extracted polygon and extended by a small band at the border. The WOHT 
contributes to overcome the problem of confusing profusion of Hough peaks. 
For example, we can apply the WOHT to the quadrangle window outlined 
in Figure 2.24, which contains the approximate parallelepiped object. The 
boundary line configuration for three visible faces should consist of nine line 
segments to be organized in three sets of three approximate parallel lines, 
respectively. In the Hough image of WOHT nine peaks must be organized in 
three stripes with three peaks in it, respectively. This constraint has to be 
considered in an approach for searching the relevant Hough peaks. However, 
a configuration like this is just a necessary characteristic but not a sufficient 
one for constructing the relevant object boundary. Further principles and 
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compatibilities of projective transformation will be considered for the purpose 
of extracting the relevant arrangement of polygons. 

Looking for Configurations of Hough Peaks 

A basic procedure for extracting configurations of Hough peaks has already 
been mentioned in Subsection 2.3.1. It extracts a certain number of Hough 
peaks and groups them by considering only the line parameter (j). Related to 
the characteristic of parallelepiped objects the procedure must yield at least 
three clusters each consisting of at least three Hough peaks, respectively. 
Another alternative procedure can be designed which executes the search 
more carefully. It is looking for the global maximum peak and thus determines 
the first relevant horizontal stripe. Within the stripe a certain number of other 
maximum peaks must be localized. Then, the stripe is erased completely and 
in this modified Hough image the next global maximum is looked for. The new 
maximum defines the second relevant stripe in which once again the specified 
number of other maximum peaks are detected. By repeating the procedure a 
certain number of times we obtain the final configuration of Hough peaks. For 
demonstration, this procedure has been applied to the window in Figure 2.24. 
A configuration of nine Hough peaks organized in three stripes of three peaks 
respectively yields the set of image lines in Figure 2.34. 




Fig. 2.34. Candidate set of nine boundary lines (for the dummy box) organized 
in three sets of three approximate parallel lines, respectively. Result of applying the 
windowed OHT to the quadrangle image window in Figure 2.24 and selecting nine 
Hough peaks organized in three stripes. 
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Although the necessary characteristic of the peak configuration holds, it 
is impossible to construct the complete object boundary, because important 
boundary lines are missing. Fortunately, a configuration of 12 Hough peaks 
organized in four stripes (see Figure 2.35) yields a more complete list of 
relevant boundary lines (see Figure 2.36). The next two subsections take 
compatibility principles and assembly level grouping criteria into account for 
evaluating or adjusting image lines for object boundary construction. 




Fig. 2.35. Result of applying the windowed OHT to the quadrangle image window 
in Figure 2.24 and selecting 12 Hough peaks organized in four stripes of three Hough 
peaks, respectively. 




Fig. 2.36. Candidate set of 12 boundary lines (for the dummy box) specified 
by the 12 Hough peaks in Figure 2.35. More relevant relevant boundary lines are 
included compared to Figure 2.34. 
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2.4.2 Vanishing-Point Compatibility of Parallel Lines 

The projective transformation of parallel boundary lines generates approxi- 
mate parallel image lines with the specific constraint that they should meet 
in one vanishing-point This vanishing-point compatibility imposes cer- 
tain qualitative constraints on the courses of Hough peaks within a hor- 
izontal stripe (specifying approximate parallel lines). Figure 2.37 shows a 
projected parallelepiped and two vanishing-points Pvi and Pv2 for two sets 
{£ii, £ i 2, £13} and {£211 >^22, >C23} of three approximate parallel line seg- 
ments, respectively. Let {rij, 4 >ij) be the polar form parameters of the lines, 
respectively. We realize for the monotonously increasing distance parameter 
fii < ri2 < ri3 of the first set of lines a monotonously increasing angle 
parameter (pn < (f>i2 < ^13, and for the monotonously increasing distance 
parameter T2 i < T22 < ^’23 of the second set of lines a monotonously decreas- 
ing angle parameter 02i > 4>22 > <('23- This specific observation is generalized 
to the following geometric compatibility. 




Fig. 2.37. Projected parallelepiped and two vanishing points and p„2. 
Monotonously increasing angle parameter (j>\\ < 4 >i 2 < <^13, and monotonously 
decreasing angle parameter <f>2i > <^22 > 023 for two sets of three approximate 
parallel line segments, respectively. 
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Assumption 2.12 (Vanishing-point compatibility) Let 

be a set of approximate parallel line segments in the image, which origi- 
nate from projective transformation of parallel line segments of the 3D ob- 
ject boundary. The extended lines related to the image line segments meet 
at a common vanishing point Pv and can be ordered according to the strong 
monotony r\ <•••< Vi <■■■< ry of the distance parameter. For this 
arrangement there is a weak monotony of the angle parameter, 

(j)i> ■ ■ ■ (j)i> ■ ■ ■ > (fv or (j>i < ■ ■ ■ (j>i < ■ ■ ■ < (jv (2-46) 

Special Cases of the Vanishing-Point Compatibility 

We have to be careful with approximate vertical lines whose angle parame- 
ter (j) is near to 0° or near to 180°. In a cluster of Hough peaks with that 
characterization all lines with (f near to 0° will be redefined by: f := — r, 
and (j) := (j)-\- 180°. This is permitted, because the equations L{p, (r, j>)) = 0 
and L{p,{f,(j))) = 0 define the same line (which is easily proven based on 
the definition for function L in equation (2.1)). Under this consideration the 
Assumption 2.12 must hold for any set of approximate parallel lines meeting 
at a common point. Consequently, the course of Hough peaks in a horizontal 
stripe must increase or decrease weak monotonously. 

Experiments to the Vanishing-Point Compatibility 

For demonstration, this vanishing-point compatibility will be examined in the 
Hough image of clustered peaks in Figure 2.35. Assumption 2.12 holds for 
the third and fourth stripe but not for the first and second stripe. Actually, 
the Hough peaks in the first stripe specify lines which are candidates for the 
short boundary lines of the object (approximate vertical lines in Figure 2.36). 
The problem arises for the middle line due to small gray value contrast be- 
tween neighboring faces. The Hough peaks of the second stripe originate from 
neighboring objects at the border of the quadrangle image window. 

Strategy for Applying the Vanishing-Point Compatibility 

The vanishing-point compatibility is useful for slightly modifying the param- 
eters r and (j) of extracted image lines. A simple procedure is applied, which 
assumes that in a set of approximate parallel lines at least two lines are reli- 
able and need not be adjusted. Candidates for this pair of seed lines are outer 
silhouette lines, which can be extracted robustly in case of high contrast be- 
tween the gray values of object and background. Otherwise, two inner bound- 
ary lines of the silhouette could serve as seed lines as well, e.g. boundary lines 
of object faces in case of high gray value contrast due to lighting conditions 
or different face colors. The reliability of a line is computed on the basis of 
line/edge orientation-deviation in Definition 2.4. However, thus far the lines 
in Figure 2.36 are not restricted to the relevant line segments of the object 
border. Therefore, we specify for each candidate line {e.g. in Figure 2.36) a 
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virtual line segment, which is of the same orientation, respectively. For the 
virtual segments a unique length is specified, which is assumed to be a lower 
bound of the lengths of all relevant boundary line segments in the image. 
Each virtual line segment will be moved in discrete steps along the affiliated 
candidate line from border to border of the polygonal window. Step by step 
the orientation compatibility is evaluated by applying equation (2.12). The 
minimum is taken as the reliability value of the line. 

The most reliable two lines are selected as seed lines and their point 
of intersection computed, which is taken as the vanishing point. Next, the 
other approximate parallel lines (which are less reliable) are redefined such 
that they intersect at the vanishing point. Finally, the redefined lines are 
slightly rotated around the vanishing point in order to optimize the reliabil- 
ity value. In consensus with this, the weak monotony constraint must hold 
in the course of Hough peaks of all approximate parallel lines. Exception 
handling is necessary, if the two seed lines are exact parallel, because there 
is no finite vanishing point. In this case the unique orientation from the seed 
lines is adopted for the less reliable lines and a slight translation is carried 
out (if necessary) to optimize their reliability values. In order to take the 
geometric/photometric compatibility into account the seed lines and/or the 
redefined lines are only accepted if the line/edge orientation compatibility 
holds (see Assumption 2.2), otherwise they are discarded. 

For example, this procedure can be applied to the set of three approx- 
imate vertical lines in Figure 2.36, represented by three non-monotonous 
Hough peaks in the first stripe of Figure 2.35. As a result, the two outer lines 
are determind as seed lines, and the inner line is slightly rotated to fulfill 
the vanishing-point compatibility. The next subsection introduces a further 
compatibility inherent in the projection of polyhedral objects, which will be 
applied in combination with the vanishing-point compatibility later on. 

2.4.3 Pencil Compatibility of Meeting Boundary Lines 

In man-made objects, the most prominent type of junction is a pencil of 
three lines (3-junction), respectively, e.g. a parallelepiped includes eight 3- 
junctions. By means of projective transformation, some parts of an opaque 
object boundary will be occluded, which makes certain junctions just partly 
visible or even invisible. For example, under general view conditions we real- 
ize in the image of a parallelepiped four 3-junctions and three 2-junctions (see 
Figure 2.37). That is, in four junctions all three converging lines are visible, 
in three junctions only two lines are visible, respectively, and one junction is 
completely occluded. A thorough analysis of visibility aspects of polyhedral 
objects was presented by Waltz for the purpose of interpreting line draw- 
ings [177, pp. 249-281]. We introduce a geometric compatibility related to 
3-junctions for which three converging lines are visible, respectively. 
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Assumption 2.13 (Pencil compatibility) Let a 3D pencil point be defined 
by the intersection of three meeting border lines of an approximate polyhedral 
3D object. The projective transformation of the 3D pencil point should yield 
just one 2D pencil point in the image. 

For illustration, we select from the image in Figure 2.36 a subset of three 
boundary lines and compute the intersection points as shown in Figure 2.38. 
Obviously, Assumption 2.13 does not yet hold because just one common in- 
tersection point is expected instead of three. The reason is that line extraction 
via Hough transformation is more or less inaccurate (like any other approach 
to line extraction). Actually, correctness and accuracy of lines can only be 
evaluated with regard to the higher goal of extracting the whole object bound- 
ary. The previously introduced vanishing-point compatibility provided a first 
opportunity of including high level goals to line extraction, and the pencil 
compatibility is a second one. 




Fig. 2.38. Subset of three boundary lines taken from Figure 2.36 and three differ- 
ent intersection points. One unique intersection point is requested in order to fulfill 
the pencil compatibility. 



Strategy for Applying the Pencil Compatibility 

In order to make Assumption 2.13 valid, we apply a simple procedure, which 
only adjusts the position parameter r of image lines. The idea is to select 
from a 3-junction the most reliable two lines (using the procedure mentioned 
above), compute the intersection point, and translate the third line into this 
point. The approach proved to be reasonable, which is because of our fre- 
quent observation that two lines of a 3-junction are acceptable accurate and 
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sometimes just the third line is deviating to a larger extent. For example, 
the most reliable two lines in Figure 2.38 are the slanted ones, and there- 
fore the intersection point is computed and the approximate vertical line is 
parallel translated into this point. More sophisticated procedures are con- 
ceivable, which flexibly fine-tune the parameters of several relevant lines in 
combination (not treated in this work). 



2.4.4 Boundary Extraction for Approximate Polyhedra 

The geometric/photometric compatibility constraints and the geometric grou- 
ping criteria at the primitive, structural, and assembly level can be combined 
in a generic procedure for extracting the arrangement of polygons for a poly- 
hedral object boundary. A precondition for the success of this procedure is 
that all relevant line segments, which are included in the arrangement of 
polygons, can be detected as peaks in the Hough image. 

Assumption 2.14 (High gray value contrast between object faces) 

All transitions between neighboring faces of a polyhedral object are character- 
ized by high gray value contrast of at least i5i3. 

Generic Procedure PEs for Extracting Arrangements of Polygons 



Procedure PE 3 



1. Apply the windowed OHT in a polygonal image window, detect a 
certain number of Hough peaks, and consider that they must be 
organized in stripes. 

2. For each stripe of Hough peaks, examine the vanishing-point com- 
patibility, and if it does not hold, then apply the procedure men- 
tioned previously. 

3. Compute intersection points for those pairs of image lines which 
are specified by pairs of Hough peaks located in different stripes. 

4. Determine all groups of three intersection points in small neighbor- 
hoods. For each group examine the pencil compatibility, and if it 
doesn’t hold, then apply the procedure mentioned previously. 

5. Based on the redefined lines, determine a certain number of most 
salient polygons (see Definition 2.15) by applying a procedure sim- 
ilar to the one presented in Section 2.3. 

6. Group the polygons into arrangements, compute an assembly value 
for each arrangement, and based on this, select the most relevant 
arrangement of polygons. 
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The pre-specified number of peaks to be extracted in the first step of the 
procedure must be high enough such that all relevant peaks are included. An- 
other critical parameter is involved in the fifth step, z. e. extracting a certain 
number of most salient polygons. We must be careful that all relevant poly- 
gons are included which are needed for constructing the complete arrange- 
ment of polygons of the object boundary. The final step of the procedure will 
be implemented dependent on specific requirements and applications. For 
example, the grouping of polygons can be restricted to arrangements which 
consist of connected, non-overlapping polygons of a certain number, e.g. ar- 
rangements of 3 polygons for describing three visible faces of an approximate 
parallelepiped. The assembly value of an arrangement of polygons can be 
defined as the mean saliency value of all included polygons. 

Experiments to the Extraction of Polygon Arrangements 

With this specific implementation the generic procedure has been applied to 
the quadrangle image window in Figure 2.24. The extracted boundary for the 
object of approximate parallelepiped shape is shown in Figure 2.39. Further- 
more, the procedure has been applied to more general octagons, e.g. a loud- 
speaker, whose surface consists of rectangles and trapezoids (see Figure 2.40). 
The generic procedure also succeeds for more complicated shapes such as the 
computer monitor, which has already been treated in the previous section 
(see Figure 2.29). The extracted boundary in Figure 2.41 demonstrates the 
usefulness of the pencil compatibility. There are four 3-junctions with unique 
pencil points, respectively (as opposed to non-unique points in Figure 2.29). 

Although the procedure for boundary extraction yields impressive results 
for more or less complicated objects, however it may fail in simple situations. 
This is due to the critical assumption that all line segments of the boundary 
must be detected explicitly as peaks in the Hough image. For images of 
objects with nearly homogeneous surface color, such as the dummy box in 
Figure 2.39, the contrast between faces is just based on lighting conditions, 
which is an unreasonable basis for boundary extraction. On the other hand, 
for objects with surface texture or inscription spurious gray value edges may 
exist, which are as distinctive as certain relevant edges at the border of the 
object silhouette. However, all linear edge sequences produce a Hough peak, 
respectively. In consequence of this, perhaps a large number of Hough peak 
must be extracted such that all relevant boundary lines are included. 



2.4.5 Geometric Reasoning for Boundary Extraction 

This section presents a modified procedure for boundary extraction which 
applies a sophisticated strategy of geometric reasoning. It is more general 
in the sense that the critical Assumption 2.14, involved in the procedure 
presented above, is weakened. However, boundary extraction is restricted to 
objects of approximate parallelepiped shape and therefore the procedure is 
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Fig. 2.39. Dummy box with approximate right-angled parallelepiped shape in a 
complex environment. Arrangement of polygons describing the visible boundary. 




Fig. 2.40. Loudspeaker with approximate octagonal shape in a complex environ- 
ment. Arrangement of polygons describing the visible boundary. 



more specific concerning the object shape. The usability of the procedure is 
based on the following assumption. 
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Fig. 2.41. Computer monitor with approximate polyhedral shape including non- 
convexities. Arrangement of polygons describing the visible boundary. 



Assumption 2.15 (Parallelepiped approximation) The reasonable type 
of shape approximation for the object in a quadrangle image window is the 
parallelepiped. 

Generic Procedure PE 4 for Boundary Extraction of Parallelepipeds 



Procedure PE 4 



1. Determine a quadrangle image window which contains an object of 
approximate parallelepiped shape. 

2. Determine just the boundary of the object silhouette which is as- 
sumed to be the most salient hexagon. 

3. Propagate the silhouette lines (outer boundary lines) to the in- 
terior of the silhouette to extract the inner lines. Apply the geo- 
metric/photometric compatibility criteria and the assembly level 
grouping criteria to extract the most relevant arrangement of poly- 
gons. 
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Experiments to the Boundary Extraction of Parallelepipeds 

Figure 2.42 shows the quadrangle image window containing the relevant tar- 
get object, i.e. a transceiver box of approximate parallelepiped shape. The 
boundary line segments of the parallelepiped silhouette must form a hexagon 
(see Figure 2.43). A saliency measure is defined for hexagons, which takes 
into account the structural level grouping criterion of reflection-symmetry 
and the aspect that the hexagon must touch a large part of the quadrangle 
contour. This yields the boundary line segments in Figure 2.43, which are or- 
ganized as three pairs of two approximate parallel line segments, respectively. 
Additionally, three inner line segments of the silhouette are needed to build 
the arrangement of polygons for the boundary of the parallelepiped. The 
vanishing-point compatibility is taken into account to propagate the approx- 
imate parallelism of outer lines to the interior of the silhouette. Furthermore, 
the pencil compatibility constrains inner lines to go through the pencil points 
of the silhouette boundary lines and additionally to intersect in the interior 
of the silhouette at just one unique point. The final arrangement of polygons 
must consist of just four 3-junctions and three 2-junctions. The combined 
use of the assembly level criteria guarantees that only two configurations of 
three inner lines are possible (one configuration is shown in Figure 2.44). 
The relevant set of three inner line segments is determined based on the best 
geometric/photometric compatibility. Figure 2.45 shows the final boundary 
line configuration for the transceiver box. 

Further examples of relevant object boundaries are given below (see Fig- 
ure 2.46 and Figure 2.47). They have been extracted from usual images of 
electrical scrap using the procedure just introduced. 




Fig. 2.42. Transceiver box with approximate right-angled parallelepiped shape. 
The black quadrangle surrounding the object indicates the image window for de- 
tailed processing. 
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Fig. 2.43. Extracted regular hexagon, which describes the approximate silhouette 
of the transceiver box. 




Fig. 2.44. Relevant set of three inner lines of the silhouette of the transceiver box. 
They have been determined by propagation from outer lines using assembly level 
grouping criteria and the geometric/photometric compatibility. 




Fig. 2.45. Transceiver box with final polygon arrangement for the parallelepiped 
boundary description. 
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Fig. 2.46. Radio with approximate right-angled, parallelepiped shape and ex- 
tracted arrangement of polygons of the bonndary. 




Fig. 2.47. Chip-carrier with approximate right-angled, parallelepiped shape and 
extracted arrangement of polygons of the boundary. 
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2.5 Visual Demonstrations for Learning Degrees of 
Compatibility 

It is required to provide a justification for the applied compatibilities. The 
degrees of compatibility must be learned in the actual environment for the 
actual task, which is done on the basis of visual demonstration. In this sec- 
tion we focus on two types of geometric/photometric compatibilities, i.e. 
line/edge orientation compatibility and parallel/ramp phase compatibility. 
Furthermore, one type of geometric compatibility under projective transfor- 
mation is treated, i.e. the parallelism compatibility. 



2.5.1 Learning Degree of Line/Edge Orientation Compatibility 

The applicability and the success of several approaches depend on accurate 
estimations of edge orientation. The orientations of gray value edges have 
been determined by applying to the image a set of four differently oriented 
2D Gabor functions. Gabor parameters are the eccentricity values {cri,<T 2 } 
of the enveloping Gaussian and the center frequencies {mi,U 2 } of the com- 
plex wave. Specific values for the Gabor parameters have influence on the 
estimation of edge orientations. The accuracy of edge orientation must be 
considered in several assumptions presented in Sections 2.2 and 2.3, i.e. in 
geometric/photometric compatibility principles and for compatibility-based 
structural level grouping criteria. More concretely, the accuracy of edge ori- 
entation plays a role in threshold parameters <5i, < 52 , < 53 , Je, 1 ^ 7 , ijg- According 
to this, values for these threshold parameters must be determined on the 
basis of the accuracy of estimated edge orientations. 

The purpose of visual demonstration is to find for the Gabor function an 
optimal combination of parameter values which maximizes the accuracy of 
orientation estimation. For explaining just the principle we make a system- 
atic variation of just one Gabor parameter, i.e. the radial component of the 
center frequency vector (cr := arctan(ui, U 2 )) and keep the other parame- 
ters fixed. A statistical approach is used which considers patterns of different 
orientations. For each rotated version the edge orientations are estimated 
under different Gabor parametrizations. The variance of estimated edge ori- 
entations relative to the required edge orientations gives a measure for the 
accuracy of estimation. 

Experiments to the Accuracy of Estimated Edge Orientations 
(Simulations) 

Exemplary, we define 20 operator versions for orientation estimation (accord- 
ing to equations (2.6), (2.7), (2.8), (2.9), (2.10) respectively) by systematic 
varying the radial center frequency Cr of the involved Gabor functions. For 
example. Figure 2.48 shows for the Gabor functions of two operators the real 
part of the impulse responses, i.e. operator version 5 and 13. 
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Fig. 2.48. Impulse response (real part) of two Gabor functions with different 
radial center frequencies. 



A series of 15 simulated images of 128 x 128 pixels is created which consist 
of a black square and a gray environment, respectively. The middle point of 
a side of the square is the image center point which serves as turning point 
for the square (see white dot). The square has been rotated in discrete steps 
of 5° from 10° to 80°. Figure 2.49 shows a subset of four images at rotation 
angles 10°, 30°, 60°, 80°. 




All 20 operators are applied at the center point of the 15 images in order to 
estimate the orientation of the edge. The accuracy of edge orientation depends 
on the radial center frequency. For example. Figure 2.50 shows for the black 
square rotated by 30° and rotated by 60° the estimations of edge orientation 
under varying radial center frequency. According to the two examples, the 
best estimation for the edge orientation is reached between frequency indices 
4 and 10. 

In order to validate this hypothesis we take all 15 images into account. 
The respective course of orientation estimations is subtracted by the relevant 
angle of square rotation which results in modified courses around value 0. In 
consequence of this, the estimation errors can be collected for all 15 images. 
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Fig. 2.50. Estimation of edge orientation at image center point under varying 
radial center frequency; (Left) Diagram for black square rotated by 30°; (Right) 
Diagram for black square rotated by 60°. 



For each of the 20 operators a histogram of deviations from 0 is determined. 
For example, Figure 2.51 shows two histograms for operator versions 5 and 



13. 




Fig. 2.51. Histograms of estimation errors for edge orientation; (Left) Diagram 
for operator version 5; (Right) Diagram for operator version 13. 



The histogram of operator version 13 is more wide-spread than that of 
version 5. This means that the use of operator version 5 is favourable, be- 
cause the probability for large errors in orientation estimation is low. For 
each of the 20 operators the variance relative to expected value 0 has been 
determined. Figure 2.52 shows on the left the course of variances, and on the 
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right a section thereof (interval between frequency indices 4 and 8) in higher 
variance resolution. Obviously, the minimum variance of approximately 1.7 is 
reached at index 5, and therefore operator version 5 gives the most accurate 
estimations of edge orientations. 




Fig. 2.52. (Left) Variance of estimation errors of edge orientation for 20 operators; 
(Right) Higher variance resolntion of relevant section. 



Experiments to the Accuracy of Estimated Edge Orientations 
(Real Images) 

So far, those experiments have been executed for simulated image patterns. 
For certain parametrizations of the Gabor functions the experiments make 
explicit the theoretical accuracy of estimating edge orientation. However, 
in real applications one or more cameras are responsible for the complex 
process of image formation and consequently the practical accuracy must 
be determined. In order to come up with useful results, we have to perform 
realistic experiments as shown by the following figures exemplary. 

All experiments from above are repeated with real images consisting of 
a voltage controller (taken from electrical scrap). Once again, a series of 15 
images of 128 x 128 pixels is created showing the voltage controller under 
rotation in discrete steps of 5° from 100° to 170°. Figure 2.53 shows a subset 
of four images with the rotation angles 100°, 120°, 150°, 170°, and a white 
dot which indicates the image position for estimating edge orientation. 

Figure 2.54 shows for the voltage controller rotated by 120° and rotated 
by 150° the estimations of edge orientation under varying radial center fre- 
quency. For the two examples the best estimation of edge orientation is 
reached between frequency indices 4 and 10 respectively between 0 and 7. 

The estimation errors must be collected from all 15 example orientations. 
For example. Figure 2.55 shows two histograms of estimation errors arising 
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Fig. 2.53. Voltage controller in four orientations from a series of 15 orientations. 





Frequency index Frequency index 

Fig. 2.54. Estimation of edge orientation at a certain point in the image of the 
voltage controller under varying radial center frequency; (Left) Diagram for rotation 
by 120°; (Right) Diagram for rotation by 150°. 



from operator versions 5 and 13, respectively. From this realistic experiment 
we observe that the variance of estimation errors is larger than the one of the 
simulated situation which was depicted in Figure 2.51. 

For each of the 20 operators the variance of orientation deviation has 
been determined. Figure 2.56 shows the course of variances on the left, and a 
section thereof (interval between frequency indices 4 and 6) in higher variance 
resolution on the right. Obviously, a minimum variance of approximately 20 
is reached at indices 5 and 6, and the relevant operator versions will yield the 
most accurate estimations of edge orientations. 

Determining Values for Threshold Parameters 

The large difference between the minimum variances in the realistic case 
(value 20) and the simulated case (value 1.7, see Figure 2.52) motivates the 
necessity of executing experiments under actual imaging conditions. We can 
take the realistic variance value 20 to specify threshold parameters (men- 
tioned in the beginning of this section and introduced in previous sections). 
For example, in our experiments it has proven useful to take for threshold <5i 
the approximate square root of the variance value, z. e. value 5. 
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Fig. 2.55. Histograms of estimation errors for edge orientation; (Left) Diagram 
for operator version 5; (Right) Diagram for operator version 13. 




Fig. 2.56. Variance of estimation errors of edge orientation for 20 operators. 



Instead of requiring that edge orientation must be computed perfectly, we 
made experiments in order to learn realistic estimations. These are used to 
determine a degree of compatibility between the orientation of a line and the 
orientations of edges along the line. 



2.5.2 Learning Degree of Parallel/Ramp Phase Compatibility 

In Section 2.3, the characteristic of the local phase has been exploited to sup- 
port the grouping of those approximate parallel line segments which belong 
to an object silhouette. The criterion is based on a theoretical invariance, z.e. 
that the local phase computed at an edge point in the direction of the gradi- 
ent angle is constant when rotating the image pattern around this edge point. 
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However, even for simulated images the gradient angle can not be determined 
exactly (see previous experiments for learning the degree of line/edge orien- 
tation compatibility). Therefore, certain variations of the local phases must 
be accepted. 

Experiments to the Accuracy of Local Phase Computation 

In the first experiment, let us first consider the series of 15 simulated images 
which consist of a black square touching and rotating around the image center 
(see Figure 2.49 for a subset of four images). The local phase computation 
at the image center in the direction of the estimated gradient angle yields 
a distribution as shown in the left diagram of Figure 2.57. The sharp peak 
close to value f indicates a ramp edge at the image center. 




Fig. 2.57. Distribution of local phases computed at a certain point in the depiction 
of an object; (Left) Diagram arises for the simulated image of black square; (Right) 
Diagram arises for the real image of a voltage controller. 



In the second experiment, we applied local phase computation to the series 
of real images consisting of the rotating voltage controller (see Figure 2.53 
for a subset of four images) . For this realistic situation a completely different 
distribution of local phases is obtained (see right diagram of Figure 2.57) in 
comparison with the simulated situation (see left diagram of Figure 2.57). 
The large variance off the ideal value ^ arises from the sensitivity of local 
phase computation under selection of the image position and adjustment 
of the center frequency. Based on visual demonstration, we extract in the 
following experiments useful data for parametrizing a more robust approach 
of local phase computation. 
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Influence of Image Position and Gabor Center Frequency 

We clarify the sensitive influence of image position and Gabor center fre- 
quency on the local phase estimation. This will be done for a ramp edge 
exemplary. At the center position of the ramp the local phase should be es- 
timated as ^ respective — depending on whether the ramp is upstairs or 
downstairs. Figure 2.58 shows on the left an image consisting of the voltage 
controller together with an overlay of a horizontal white line segment which 
crosses the boundary of the object. In the diagram on the right, the gray 
value structure along the virtual line segment is shown which is a ramp going 
downstairs (from left to right) . Furthermore, the center position of the ramp 
edge is indicated which marks a certain point on the object boundary. 




Fig. 2.58. (Left) Example image of the voltage controller with virtual line segment; 
(Right) Gray value structure along the virtual line segment and position of the ramp 
edge. 



For simplifying the presentation it is convenient to project the phase com- 
putation onto the first quadrant of the unit circle, z. e. onto the circle bow up 
to length Along the virtual line segment in the left image of Figure 2.58 
just one ramp edge is crossing. Consequently, the course of projected local 
phases along the virtual line segment should be an unimodal function with 
the maximum value ^ at the center position of the ramp edge. Actually, we 
computed this modified local phase for each discrete point on the virtual line 
segment and repeated the procedure for different Gabor center frequencies. 
Considering the assumptions, it is reasonable to take the position of the max- 
imum of the local phase along the virtual line segment as the location of the 
edge. Figure 2.59 shows four courses of local phases for four different center 
frequencies. There is a variation both in the position and in the value of the 
maximum when changing the Gabor center frequency. In the diagrams the 
desired position and desired local phase are indicated by a vertical and hori- 
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zontal line, respectively. The higher the center frequency the less appropriate 
is the operator to detect a ramp edge. This becomes obvious by comparing 
the one distinguished global maximum produced by operator version 5 (left 
diagram, bold course) with the collection of local maxima produced by oper- 
ator version 14 (right diagram, dotted course). However, it happens that the 
value of the global maximum from version 14 is nearer to the desired value ^ 
than the value arising from operator version 5. Furthermore, the maximum 
values for the local phases can differ to a certain extent when applying op- 
erators with similar center frequencies, e.g. comparing results from operator 
versions 5 and 7 (left diagram, bold and dotted course). 




Fig. 2.59. Courses of local phases along the line segment of Figure 2.58; (Left) 
Courses for frequency indices 5 (bold) and 7 (dotted) ; (Right) Courses for frequency 
indices 10 (bold) and 14 (dotted). 



Determining Useful Operators for Local Phase Computation 

A principled treatment is required to obtain a series of useful operators. We 
systematical increase the Gabor center frequency and altogether apply a bank 
of 25 operator versions (including indices 5, 7, 10, 14 from above) to the line 
segment of Figure 2.58 (left). From each of the 25 courses of local phases the 
position and value of the global maximum will be determined. Figure 2.60 
shows in the diagram on the left the course of position of the global maxi- 
mum and in the diagram on the right the course of the value of the global 
maximum for varying center frequencies. In the left diagram the actual posi- 
tion of the edge is indicated by a horizontal line. Based on this, we determine 
a maximal band width of center frequencies for which the estimated edge 
position deviates from the actual one only to a certain extent. For example, 
if accepting a deviation of plus/minus two pixel, then it is possible to apply 
operator versions 0 up to 13. In the right diagram the desired local phase ^ is 
indicated by a horizontal line. Based on this, we determine a maximal band 
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width of center frequencies for which the estimated phase deviates from the 
desired one only to a certain extent. For example, if accepting a deviation of 
maximal 20 percent, then it is possible to apply operator versions 0 up to 23. 
Based on these two constraints, the series of appropriate operator versions is 
determined by interval intersection, z. e. resulting in operator versions 0 up to 
13. Based on the phase estimations from these subset of relevant operators, 
we compute the mean value for the purpose of robustness, e.g. resulting in 
this case to 1.45. 





Fig. 2.60. Courses of estimated edge position (left) and local phase (right) under 
varying center frequency of the operator, determined along the line segment for the 
image in Figure 2.58. 



In order to validate both the appropriateness of the operators and the 
mean value of phase estimation, we have to rotate the voltage controller and 
repeat the procedure again and again. For example, under a clock-wise ro- 
tation by 5° (relative to the previous case) we obtain appropriate operator 
versions 6 up to 16, and the mean value of phase estimation is 1.49. Alterna- 
tively, a counter clock-wise rotation by 5° (relative to the first case) reveals 
appropriate operator versions 9 up to 17, and the mean value of phase esti- 
mation is 1.45. As a result of the experimentation phase, it makes sense to 
determine those series of operators which are appropriate for all orientations. 

Determining Values for Threshold Parameters 

Based on the maximal accepted or actual estimated deviation from we 
can determine an appropriate value for threshold i56 (see Assumption 2.5), 
which represents the degree of parallel/ramp phase compatibility. Further- 
more, based on the experiments we can specify the area around an edge 
point in which to apply local phase computation. For example, a maximal 
deviation of plus/minus 2 pixel has been observed, and actually this measure 
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can be used as stripe along an image line which has been extracted by Hough 
transformation. 

2.5.3 Learning Degree of Parallelism Compatibility 

The perspective projection of parallel 3D lines yields approximate parallel 
lines in the image. A basic principle of structural level grouping is the search 
for these approximate parallel lines (see Section 2.3). The degree of deviation 
from exact parallelism must be learned on the basis of visual demonstration. 
For this purpose we mount the task-relevant objective, put the camera at a 
task-relevant place, and take images from a test object under varying rota- 
tion angle. An elongated, rectangular paper is used as test object with the 
color in clear contrast to the background. From the images of the rotating 
object we extract a certain pair of object boundary lines, i.e. the pair of ap- 
proximate parallel lines which are the longest. Orientation-selective Hough 
transformation can be applied as basic procedure (see Section 2.2) and the 
relevant lines are determined by searching for the two highest peaks in the 
Hough image. From the two lines we take only the polar angles, compute 
the absolute difference and collect these measurements for the series of dis- 
crete object rotations. In the experiments, the rectangle has been rotated in 
discrete steps of 10° from approximately 0° to 180°. 

Experiments to the Perspective Effects on Parallelism 

In the first experiment, an objective with focal length 24mm has been used, 
and the distance between camera and object was about 1000mm. Figure 2.61 
(top) shows a subset of four images at the rotation angles 10°, 50°, 100°, 
140°, and therein the extracted pairs of approximate parallel lines. For the 
whole series of discrete object rotations the respective difference between the 
polar angles of the lines is shown in the diagram of Figure 2.62 (left). The 
difference between the polar angles of the lines varies between 0° and 5°, 
and the maximum is reached when the elongated object is collinear with the 
direction of the optical axis. 

For demonstrating the influence of different objectives we repeated the 
experiment with an objective of focal length 6mm, and a distance between 
camera and object of about 300mm. Figure 2.61 (bottom) shows a subset 
of four images at the rotation angles 10°, 50°, 100°, 140°, and therein the 
extracted pairs of approximate parallel lines. For the whole series of discrete 
object rotations the respective difference between the polar angles of the lines 
is shown in the diagram of Figure 2.62 (right). The difference between the 
polar angles of the lines varies between 1° and 7°. By experimentation, we 
quantified the increased perspective distortion for objectives of small focal 
length. 

Taking the task-relevant experiment into account, we can specify a de- 
gree of compatibility for the geometric projection of parallel 3D lines. The 
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Fig. 2.61. Images taken nnder focal length 24mm (top row) or taken nnder focal 
length 6mm (bottom row); rectangle object in four orientations from a series of 18, 
and extracted pairs of approximate parallel lines. 





Fig. 2.62. Based on images taken under focal length 24mm (left diagram) or 
taken under focal length 6mm (right diagram); courses of deviations from exact 
parallelism for the rectangle object under rotation. 



experimental data are used to supply values for the threshold parameters (5s, 
<5g, i5io, i5ii, introduced in Section 2.3. 



2.6 Summary and Discussion of the Chapter 

The system for boundary extraction is organized in several generic proce- 
dures, for which the relevant definitions, assumptions, and realizations have 
been presented. In general, it works successful for man-made objects of ap- 
proximate polyhedral shape. Interestingly, the various assumptions can be 
organized into three groups by considering the level of generality. 
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• The first group introduces general geometric/photometric compat- 
ibilities for polyhedral objects. 

• The second group considers geometric compatibilities under pro- 
jective transformation which hold for a subset of viewpoints. 

• The third group incorporates more specific assumptions concerning 
object appearance and shape. 



Therefore, the three groups of assumptions are stratified according to 
decreasing generality, which imposes a certain level of speciality on the pro- 
cedures. 

Validity of the Assumptions of the First Group 

The assumptions of the first group should be valid for arbitrary polyhe- 
dral objects from which images are taken with usual camera objectives. This 
first group consists of the Assumptions 2.1, 2.2, 2.3, 2.5, 2.6, and 2.7, which 
are based on functions for evaluating geometric/photometric compatibilities. 
These compatibilities are considered between global geometric features on 
the one hand, e.g. line segment, line pencil, quadrangle, or an arbitrary poly- 
gon, and local photometric features on the other hand, e.g. edge orientation, 
corner characterization, local phase. Threshold parameters (5i, 62 ^ 6 ^, 64 ^, 

S'/, 6 s are involved for specifying the necessary geometric/photometric com- 
patibility. 

These criteria can be used for accepting just the relevant line structures 
in order to increase the efficiency of subsequent procedures for boundary ex- 
traction. According to our experience, the parameters can be determined in a 
training phase prior to the actual application phase. They mainly depend on 
the characteristics of image processing techniques involved and of the cam- 
era objectives used. For example, we must clarify in advance the accuracy 
of the orientation of gray value edges and the accuracy of the localization 
of gray value corners, and related to the process of image formation, we are 
interested in the field of sharpness and the distortion effects on straight lines. 
Based on these measurements, we compute the actual degrees of deviation 
from exact invariance and conclude about acceptance of compatibilities. If 
a compatibility exists then the relevant threshold parameters are specified. 
In the case of rejection, one must consider more appropriate image process- 
ing techniques and/or other camera objectives. According to this, the role 
of experimentation is to test the appropriateness of certain image process- 
ing techniques in combination with certain camera objectives for extracting 
certain image structures. 
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Validity of the Assumptions of the Second Group 

The assumptions of the second group are supposed to be valid for arbitrary 
polyhedral objects but there is a restriction concerning the acceptable view 
conditions. This second group consists of the Assumptions 2.4, 2.8, 2.9, 2.10, 
2.12, 2.13. They impose constraints on the projective transformation of ge- 
ometric features of 3D object shapes. To consider the regularity aspect of 
man-made objects a set of collated regularity features is used, such as par- 
allel lines, right-angled lines, reflection-symmetric polylines, or translation- 
symmetric polylines. The object shapes are detected in the image as salient 
polygons or arrangements of polygons. Several saliency measures have been 
defined on the basis of geometric/photometric compatible features and the 
collated regularity features (just mentioned). 

It is essential that also the regularity features are compatible under pro- 
jective transformation. Regularity and compatibility depend on each other, 
e.g. the vanishing-point compatibility affects the necessary degree of devia- 
tion from parallelism. The degree of deviation from exact invariance depends 
on the spectrum of permissible camera positions relative to the scene ob- 
jects. Threshold parameters (55, (5g, Sio, <^ii are involved in the assumptions 
for describing the permissible degrees of deviation from exact invariance. For 
example, if we would like to locate the right-angled silhouette of a flat object 
(e.g. an electronic board) then the camera must be oriented approximately 
perpendicular to this object, and this will be considered in the parameter (5n 
(see Assumption 2.10). 

Validity of the Assumptions of the Third Group 

The basic assumption that the scene consists of approximate polyhedral ob- 
jects usually is too general for providing one and only one generic procedure 
for boundary extraction. Therefore, a third group of constraints is introduced 
consisting of the Assumptions 2.11, 2.14, 2.15. They impose constraints on 
the gray value appearance and the shape of the depicted objects. We must ex- 
amine whether an extracted polygon is an approximate representation of the 
object silhouette, or examine whether the transistions between object faces 
have high gray value contrast, or examine whether the shape of an object in 
a quadrangle image window is an approximate parallelepiped. Threshold pa- 
rameters i5i 2 and (5i3 are involved for quantifying these constraints. Although 
the assumptions of this third group are more specific than those of the other 
two groups, they are somewhat general. 

Gaussian Assumption Related to Feature Compatibilities 

In numerous experiments we observed that the distribution of deviations 
from invariance is Gaussian-like, more or less. That was the motivation for 
comparing the concept of compatibility with the concept of invariance via 
the Gaussian, i.e. invariance is a special case of compatibility with sigma 
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equal 0. More generally, the deviations from the theoretical invariant can 
be approximated by computing the covariance matrix and thus assuming a 
multi-dimensional Gaussian which maybe is not rotation-symmetric and not 
in normal form. The sophisticated approximation would be more appropri- 
ate for deciding whether certain procedures are applicable, and determine 
appropriate parameters from the covariance matrix instead of scalar sigma. 

Summary of the Approach of Boundary Localization 

Altogether, our approach succeeds in locating and extracting the boundary 
line configurations for approximate polyhedral objects in cluttered scenes. 
Following Occam’s minimalistic philosophy, the system makes use of funda- 
mental principles underlying the process of image formation, and makes use 
of general regularity constraints of man-made objects. Based on this, the role 
of specific object models is reduced. This aspect is useful in many realistic 
applications, for which it is costly or even impossible to acquire specific ob- 
ject models. For example, in the application area of robotic manipulation of 
electrical scrap (or car scrap, etc.), it is inconceivable and anyway unneces- 
sary to explicitly model all possible objects in detail. For robotic manipula- 
tion of the objects approximate polyhedral descriptions are sufficient, which 
can be extracted on the basis of general assumptions. The novelty of our 
methodology is that we maximally apply general principles and minimally 
use object-specific knowledge for extracting the necessary information from 
the image to solve a certain task. 

Future work should discover more compatibilities between geometry and 
photometry of image formation, and more compatible regularity features un- 
der projective transformation.® The combination of compatibility and reg- 
ularity must be treated thoroughly, e.g. solving the problem of combined 
constraint satisfaction. An extension of the methodology beyond man-made 
objects, e.g. natural objects such as faces, is desirable. 

The next chapter presents a generic approach for learning operators for 
object recognition which is based on constructing feature manifolds. Actually, 
a manifold approximates the collection of relevant views of an object, and 
therefore represents a kind of compatibility between object views. However, 
the representation of the compatibilities between various object views is more 
difficult compared to the compatibilities treated in this chapter. 



® In Subsection 4.3.5 we will introduce another type of compatibility under pro- 
jective transformations which will prove useful for treating the problem of stereo 
matching. 
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This chapter presents a generic approach for learning operators for object 
recognition or situation scoring, which is based on constructing feature mani- 
folds. 



3.1 Introduction to the Chapter 

The introductory section of this chapter describes in a general context the 
central role of learning for object recognition, then presents a detailed review 
of relevant literature, and finally gives an outline of the following sections.^ 



3.1.1 General Context of the Chapter 

Famous physiologists {e.g. Hermann von Helmholtz) insisted on the central 
role of learning in visual processes [46]. However, only a few journals dedi- 
cated issues to this aspect {e.g. [5]), only a few workshops focused on learning 
in Computer Vision {e.g. [19]), and finally, only a few doctoral dissertations 
treated learning as the central process of artificial vision approaches {e.g. 
[24]). We strongly believe that the paradigm of Robot Vision must completely 
be arranged around learning processes at all levels of feature extraction and 
object recognition. The inductive theory of vision proposed in [65] is in con- 
sensus with our believe, i.e. the authors postulate that vision processes obtain 
all the basic representations via inductive learning processes. Contrary to the 
“school of D . Marr” , machine vision is only successful by using both the input 
signal and, importantly, using also learned information. The authors continue 
with the following two statements which we should emphasize. 



’’The hope is that the inductive processes embody the universal and 
ejjicient means for extracting and encoding the relevant information 
from the environment. ” 



^ The learning of signal transformations is a fundamental characteristic of Robot 
Vision, see Section 1.2. 

J. Pauli: Learning-Based Robot Vision, LNCS 2048, pp. 101-169, 2001. 
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’’The evolution of intelligence could be seen, not as ad hoc, but as a 
result of interactions of such a learning mechanism with the environ- 
ment. ” 



In consensus with this, a main purpose of this work is to declare that all 
vision procedures, e.g. for signal transformation, feature extraction, or object 
recognition, must be based on information which has been adapted or learned 
in the actual environment (see Section 1.4). 

In Section 2.5 it was shown exemplary how to determine useful (intervals 
of) parameter values in order to apply procedures for boundary line extrac- 
tion appropriately. Furthermore, for various types of compatibilities which do 
hold under real image formation, the degrees of actual deviations from invari- 
ants have been determined. All those is based on visual demonstrations in the 
task-relevant environment and the necessary learning process simply consists 
of rules for closing intervals or for approximating distributions. In this chap- 
ter the (neural) learning plays a more fundamental role. Operators must be 
learned for the actual environment constructively, because the compatibilities 
presented in the previous Chapter 2 are not sufficient to solve certain tasks. 
For example, we would like to recognize certain objects in the scene in order 
to manipulate them specifically with the robot manipulator. The learning 
procedure for the operators is based on 2D appearance patterns of the rele- 
vant objects or response patterns resulting from specific filter operations. The 
main interest is to represent or approximate the pattern manifold such that 
an optimal compromise between efficiency, invariance and discriminability 
of object recognition is achieved. 

3.1.2 Approach for Object and Situation Recognition 

This work focuses on a holistic approach for object and situation recognition 
which treats appearance patterns, filtered patterns, or histogram patterns 
(vectors) and leaves local geometric features implicit. The recognition is based 
on functions which are not known a priori and therefore have to be learned in 
the task-relevant environment. It is essential to keep these functions as simple 
as possible, as the complexity is correlated to the time needed for object or 
situation recognition. Three aspects are considered in combination in order 
to meet this requirement. 

• Appearance patterns should be restricted to relevant windows, e.g. by tak- 
ing the silhouette boundary of objects into account (see Section 2.3) in 
order to suppress the neighborhood or background of a relevant object. 

• Only those appearance variations should be taken into account which in- 
dispensable occur in the process of task-solving, e.g. change of lighting 
condition, change of relation between object and camera, etc. 

• For the task-relevant variety of appearance patterns we are looking for 
types of representation such that the manifold dimension decreases, e.g. 





3.1 Introduction to the Chapter 103 



gray value normalization, band-pass Gabor filtering, log-polar transforma- 
tion. 

Coarse-to-Fine Strategy of Learning 

Techniques for simplifying the pattern manifold are applied as a pre-process- 
ing step of the learning procedure. For learning the recognition function a 
coarse-to-fine strategy is favoured. The coarse part treats global and the fine 
part treats local aspects in the manifold of patterns. 

First, a small set of distinguished appearance patterns {seed patterns) 
are taken under varying imaging conditions, represented appropriately, and 
approximated by a learning procedure. Usually, these patterns are globally 
distributed in the manifold, and they serve as a canonical frame (CF) for 
recognition. The algebraic representation is by an implicit function whose 
value is (approximately) 0 and in this sense the internal parameters of the 
function describe a global compatibility between the patterns. Computation- 
ally, the implicit function will be learned and represented by a network of 
Gaussian basis functions (GBF network) or alternatively by principal com- 
ponent analysis (PGA). 

Second, the global representation is refined by taking appearance patterns 
from counter situations into account, which actually leads to local specializa- 
tions of the general representation. The implicit function of object recognition 
is modified such that compatibility between views is conserved but with the 
additional characteristic of discriminating more reliably between various ob- 
jects. Furthermore, the recognition function can be adjusted more carefully 
in critical regions of the pattern manifold by taking and using additional 
images of the target object. For the local modification of the global mani- 
fold representation we apply once again Gaussian basis functions or principal 
component analysis. 

In summary, our approach for object and situation recognition uses recog- 
nition functions which are acquired and represented by mixtures of GBFs and 
PGAs. Based on the brief characterization of the methodology we will review 
relevant contributions in the literature. 

3.1.3 Detailed Review of Relevant Literature 

A work of Kulkarni et al. reviews classical and recent results in statisti- 
cal pattern classification [92]. Among these are nearest neighbor classifiers, 
the closely related kernel classifiers, classification trees, and various types 
of neural networks. Furthermore, the Vapnik-Ghervonenkis theory of learn- 
ing is treated. Although the work gives an excellent survey it can not be 
complete, e.g. principal component analysis is not mentioned, dynamically 
growing neural networks are missing. 
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Principal Component Analysis (PCA) 

Principal component analysis (PCA) is a classical statistical method which 
determines from a random vector population the system of orthogonal vec- 
tors, i.e. so-called principal components, such that the projection of the ran- 
dom vectors onto the first component yields largest variance, and the projec- 
tion onto the second component yields second largest variance, and so on [60, 
pp. 399-440]. The principal components and the variances are obtained by 
computing eigenvectors and eigenvalues of the covariance matrix constructed 
from the random vector population. For the purpose of classification, the 
eigenvectors with the largest eigenvalues are the most important one, and 
class approximation takes place by simply omitting the eigenvectors with 
the lowest eigenvalues. The representation of a vector by (a subset of most 
important) eigenvectors, i.e. the eigenspace, is called the Karhunen-Loeve 
expansion (KLE) which is a linear mapping. 

Turk and Pentland applied the PCA approach to face recognition and also 
considered problems with different head sizes, different head backgrounds and 
localization/tracking of a head [168]. Murase and Nayar used PCA for the 
recognition of objects which are rotated arbitrary under different illumina- 
tion conditions [112]. The most serious problem with PCA is the daring as- 
sumption of a multi-dimensional Gaussian distribution of the random vector 
population, which is not true in many realistic applications. Consequently, 
many approaches of nonlinear dimension reduction have been developed in 
which the input data are clustered and local PCA is executed for each clus- 
ter, respectively. A piece- wise linear approximation is obtained which is global 
nonlinear [89, 163]. In a work of Prakash and Murty the clustering step is 
more closely combined with PCA which is done iteratively on the basis of 
classification errors [133]. A work of Bruske and Sommer performs cluster- 
ing within small catchment areas such that the cluster centers can serve as 
reasonable representatives of the input data [30]. Based on this, local PCA 
is done between topologically neighbored center vectors instead of the larger 
sets of input vectors. 

Radial Basis Function Networks (RBF Networks) 

The construction of topology preserving maps (functions) from input spaces 
to output spaces is a central issue in neural network learning [106]. It was 
proven by Hornik et al. that multilayer feedforward networks with arbitrary 
squashing function, e.g. MLP (multilayer perceptron) networks, can approx- 
imate any Borel-measurable function to any desired degree of accuracy, pro- 
vided sufficiently many hidden units are available [81]. Similar results have 
been reported for so-called regularization networks, e.g. RBF (radial basis 
function) networks and the more general HBF (hyper basis function) net- 
works [63], which are theoretically grounded in the regularization theory. 
Besides the theoretical basis the main advantage of regularization networks 
is the transparency of what is going on, i.e. the meaning of nodes and links. 
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A regularization network interpolates a transformation of m-dimensional vec- 
tors into p-dimensional vectors by a linear combination of n nonlinear basis 
functions. Each basis function, represented in a hidden node, operates as a lo- 
calized receptive field and therefore responds most strongly for input vectors 
localized in the neighborhood of the center of the field. 

For example, the spherical Gaussian is the most popular basis function 
which transforms the distance between an input vector and the center vector 
into a real value of the unit interval. Classically, the centers and the extent 
of the receptive fields are learned with unsupervised methods, while the fac- 
tors for combining the basis functions are learned in a supervised manner. 
Hyper basis functions are prefered for more complex receptive fields, e.g. 
hyper-ellipsoidal Gaussians for computing the Mahalanobis distance to the 
cluster center (based on local PGA). We will introduce the unique term GBF 
networks for specific regularization networks which consist of hyper- spherical 
or hyper-ellipsoidal Gaussians. It is the transparency why specifically the 
GBF networks are used in dynamic network architectures, i.e. the hidden 
nodes are constructed dynamically in a supervised or unsupervised learning 
methodology [59, 29] . Transparency has also been a driving force for the de- 
velopment of parallel consensual neural networks [18] and mixture- of- experts 
networks [174]. 

Support Vector Networks (SV Networks) 

Recently, the so-called support vector networks became popular whose con- 
struction is grounded on the principle of minimum description length [139], 
i.e. the classifier is represented by a minimum set of important vectors. The 
methodology of support vector machines consists of a generic learning ap- 
proach for solving nonlinear classification or regression problems. The pi- 
oneering work was done by Vapnik [171], an introductory tutorial is from 
Burges [31], and an exemplary application to object recognition has been re- 
ported in a work of Pontil and Verri [132] . The conceptual idea of constructing 
the classifier is to transform the input space nonlinear into a feature space, 
and then determine there a linear decision boundary. The optimal hyperplane 
is based only on a subset of feature vectors, the so-called support vectors (SV), 
which belong to the common margin of two classes, respectively. The relevant 
vectors and the parameters of the hyperplane can be determined by solving a 
quadratic optimization problem. As a result, the decision function is a linear 
combination of so-called kernel functions for which Mercer’s condition must 
be satisfied. 

An example of a kernel function is the spherical Gaussian. Accordingly, 
the formula implemented by a support vector machine with Gaussian kernels 
is identical to the formula implemented by an RBF network [160]. The distinc- 
tion between an RBF network and a support vector machine with Gaussian 
kernels is by the technique of constructing the unknowns, i.e. the center vec- 
tors, the extensions and the combination factors of the Gaussians. In a work 
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of Scholkopf et al. the classical approach of training an RBF network (un- 
supervised learning of Gaussian centers, followed by supervised learning of 
combination factors) is compared with the support vector principle of deter- 
mining the unknowns, and as a result, the SV approach proved to be superior 
[151].^ 

Bias/ Variance Dilemma in SV Networks 

Support vector machines enable the treatment of the bias /variance dilemma 
[62] by considering constraints in the construction of the support vectors. In 
order to understand the principle we regard a class as a set of local transfor- 
mations which have no influence on the class membership. The PhD thesis of 
Scholkopf presents two approaches within the support vector paradigm which 
consider and make use of this aspect [152, pp. 99-123]. In the first approach, 
the support vector machine is trained to extract the set of support vectors, 
then further virtual examples are generated in a localized region around the 
support vectors, and Anally, a new support vector machine is learned from 
the virtual data. The second approach is similar to the first one, however, 
the learning procedure is not repeated with virtual data, but a regularization 
term is considered which is based on a known one-parameter group of local 
transformations, e.g. the tangent vectors at the set of support vectors.^ In ad- 
dition to these two approaches, which consider local transformations, a third 
approach is conceivable for certain applications. It is characterized by a global 
transformation of the input data into a new coordinate system in which class 
membership can be determined easily. For example, by taking the Lie group 
theory into account it is possible to construct kernels of integral transforms 
which have fine invariance characteristics [154]. The role of integral trans- 
forms for eliciting invariances and representing manifolds can be exemplified 
by the Fourier transform, e.g. the spectrum of a Fourier-transformed image 
does not change when translating a pattern in the original image. 

The three approaches of enhancing the learning procedure are summa- 
rized in Figure 3.1 (adopted from the PhD thesis of Scholkopf [152, p. 101]). 
For simplicity a two-dimensional space is assumed, in which the dashed line 
depicts the desired decision boundary, and the black and gray disks represent 
elements from two different classes. The first approach (on the left) consid- 
ers virtual examples around the actual examples, the second approach (in 

^ However, the comparison seems to be unfair, because one can alternatively switch 
to another RBF learning approach which determines not only the Gaussian com- 
bination factors but also the Gaussian center vectors by error backpropagation 
([21, pp. 164-193], [122]). Furthermore and at least equal important, the compar- 
ison considers only spherical Gaussians. However, a network of hyper-ellipsoidal 
Gaussians, i.e. a GBF network with each Gaussian determined by local PGA, 
usually does a better job in classification tasks [128, 69]. 

® A work of Burges treats the geometry of local transformations and the incorpora- 
tion of known local invariances into a support vector machine from a theoretical 
point of view [32]. 
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the middle) uses tangent values around the examples, and the global trans- 
formation approach (on the right) is visualized by large circles through the 
examples which indicate that possibly only the radius is of interest for clas- 
sification. 




Fig. 3.1. (Left) Enhanced learning by virtual examples; (Middle) Tangent vectors; 
(Right) Modified representation. 



Classical Invariance Concepts for Recognition Tasks 

Although the concept of invariance plays a significant role in recognition, we 
only give a short survey. The reason is that the known approaches are only of 
limited use for our purpose. Mundy and Zisserman reviewed geometric invari- 
ants under transformation groups and summarized methods for constructing 
them [111]. However, methods such as the elimination of transformation pa- 
rameters are not adequate for hard recognition tasks, because we do not 
know the transformation formula and can not assume group characteristics. 
Alternatively, the review given by Wechsler is more extensive, as it includes 
geometric, statistical, and algebraic invariances [175, pp. 95-160]. Related 
to appearance-based object recognition the aspect of statistical invariance is 
treated by principal component analysis. 

Evaluating Learning Procedures and Learned Constructs 

The usefulness and performance of all these learning procedures should be 
assessed on the basis of objective criteria in order to choose the one which 
is most appropriate for the actual task. However, a formal proof that a cer- 
tain learning algorithm is superior compared to other ones can not be given 
(usually). The goal of theoretical work on learnability is to provide answers 
to the following questions. 



What problems can be learned, how much training data is required, 
how much classification errors will occur, what are the computational 
costs of applying the learning algorithm ? 
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The most popular formalization of these questions is from Vapnik and 
Chervonenkis who introduced the so-called PAC-learnability with PAC an 
acronym of “probably approximately correct” [170]. A function is called PAC- 
learnable if and only if a learning algorithm can be formulated which produces 
with a certain expenditure of work a probably approximately correct func- 
tion. Based on pre-specified levels for probability and correctness of a function 
approximation, we are interested in discovering the most simple and efficient 
applicable representation (Occam’s Razor). For example, authors of the neu- 
ral network community derive lower and upper bounds on the sample size 
versus net size needed such that a function approximation of a certain qual- 
ity can be expected [15, 6]. On account of applying Occam’s Razor to GBF 
networks, it is desirable to discover the minimum number of basis functions 
in order to reach a critical quality for the function approximation. 

3.1.4 Outline of the Sections in the Chapter 

Section 3.2 describes the learning of functions for object recognition as the 
construction of pattern manifolds. Two approximation schemes are compared, 
i.e. networks of Gaussian basis functions and principal component analysis. 
In Section 3.3 GBF networks are applied for globally representing a manifold 
of object appearances under varying viewing conditions or a manifold of 
varying grasping situations. Based on visual demonstration, the number and 
the extent of the Gaussians are modified appropriately in order to obtain 
recognition functions with a certain quality. Section 3.4 presents alternative 
representations of appearance manifolds. These include a fine-tune of the 
Gaussians by considering topologically neighbored patterns, or alternatively 
a global PGA instead of a GBF network. Furthermore, the global manifold 
representation will be refined by taking appearance patterns from counter 
situations into account using GBFs once again. Section 3.5 discusses the 
approaches of the preceding sections. 



3.2 Learning Pattern Manifolds with GBFs and PCA 

This section describes the learning of functions for object recognition as the 
construction of pattern manifolds. Robustness of recognition is formulated in 
the PAG terminology. Two approximation schemes are compared, i.e. net- 
works of Gaussian basis functions and principal component analysis. 

3.2.1 Compatibility and Discriminability for Recognition 

Usually, invariants are constructed for groups of transformations. The more 
general the transformations the more difficult to extract invariants of the 
whole transformation group. Even if it is possible to extract invariants for a 
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general transformation group, they are of little use in practice due to their 
generality. For an object under different view conditions the manifold of ap- 
pearance patterns is complex in nature, z.e. general transformations between 
the patterns. Instead of determining invariants for a geometric transforma- 
tion group, we are interested in compatihilities among object appearances. The 
compatibilities should be as specific as necessary for discriminating a target 
object from other objects. The actual task and the relevant environment 
are fundamental for constructing an appropriate recognition function. First, 
as a preprocessing step the pattern manifold should be simplified by image 
normalization, image filtering, and representation changing. This strategy 
is applied several times throughout the sections in this and the next chap- 
ter. Second, only the relevant subset of transformations should be considered 
which must be learned on the basis of visual demonstration. Formally, the 
learning step can be treated as parameter estimation for implicit functions. 

Invariance Involved in Implicit Functions 

Let /**” be an implicit function with parameter vector B and input-output 
vector Z , such that 

f^{B,Z) = 0 (3.1) 

The parameter vector B characterizes the specific version of the function Z®™ 
which is of a certain type. The vector Z is called the input-output vector in 
order to express that input and output of an explicitly defined function are 
collected for the implicitly defined function. Input-output vector Z is taken 
as variable and we are interested in the manifold of all realizations for which 
equation (3.1) holds. In order to introduce the term compatibility into the 
context of manifolds, we may say, that all realizations in the manifold are 
compatible to each other. 

For example, for all points of a two-dimensional ellipse (in normal form) 
the equation (3.1) holds if 

/™(i?,Z):=^ + |-l, (3.2) 

with parameter vector B := ( 61 , 62 )^ containing specifically the two half- 
lengths bi and 62 of the ellipse axes in normal form, and vector Z := {x\, X 2 )^ 
containing the 2D coordinates. In terms of the Lie group theory of invariance 
[125], the manifold of realizations of Z is the orbit of a differential operator 
which is generating the ellipse. Function /™ is constant for all points of the 
elliptical orbit, and therefore the half-lengths of the ellipse axes are invariant 
features of the responsible generator. 

Implicit Functions for Object or Situation Recognition 

For the purpose of object or situation recognition the function /™ plays the 
role of a recognition function and therefore is much more complicated than 
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in the previous example. The variation of the real world environment, e.g. 
object translation and/or rotation, background and/or illumination change, 
causes transformations of the appearance pattern, whose task-relevant orbit 
must be approximated. In the next two sections we present two approaches for 
approximating the recognition function, i.e. networks of Gaussian basis func- 
tions (GBF approach) and principal component analysis (PGA approach). In 
the GBF approach the parameter vector B comprises positions, extents, and 
combination factors of the Gaussians. In the PGA approach the parameter 
vector B comprises the eigenvectors and eigenvalues. The input-output vec- 
tor Z is separated in the input component X and the output component Y , 
with the input part representing appearance patterns, filtered patterns, or 
histogram patterns of objects or situations, and the output part representing 
class labels or scoring values. 

The Role of Inequations in Implicit Recognition Functions 

In order to apply either the GBF approach or the PGA approach, the implicit 
function Z*'" must be learned in advance such that equation (3.1) holds more 
or less for patterns of the target object and clearly not holds for patterns of 
counter situations. Solely small deviations from the ideal orbit are accepted 
for target patterns and large deviations are expected for counter patterns. 
The degree of deviation can be controlled by a parameter ■0. 

|/™(i?,Z)|<0 (3.3) 

The function can be squared and transformed by an exponential function 
in order to obtain a value in the unit interval. 

f^\B, Z) := exp {~n{B, Zf) (3.4) 

If function yields value 0, then vector Z is infinite far away from the 
orbit, else if function yields value 1, then vector Z belongs to the orbit. 
Equation (3.1) can be replaced equivalently by 

f^\B,Z) = l (3.5) 

For reasons of consistency, we also use the exponential function to transform 
parameter ip into 0. Parameter ip was a threshold for distances, but parameter 
Z is a threshold for proximities. 

Z := exp {—'ip'^) (3.6) 

With this transformations, we can replace equation (3.3) equivalently by 

f\B,Z)>C (3.7) 

Based on these definitions, we extend the usage of the term compatihility, 
and embed it in the concept of manifolds. 




3.2 Learning Pattern Manifolds with GBFs and PCA 111 



Assumption 3.1 (Manifold compatibility) Let function and param- 
eter vector B he defined as above, and let threshold C be the minimal proximity 
to the ideal orbit of f^''. Then all realizations of Z for which equation (3.1) 
holds are compatible to each other. 

In the application of object or situation recognition, we must embed the 
compatibility criterion into the requirement for a robust operator. The ro- 
bustness of recognition is defined by incorporating an invariance criterion and 
a discriminability criterion. The invariance criterion strives for an operator 
which responds nearly equal for any pattern of the target object, i.e. com- 
patibility criterion with threshold ( near to 1. The discriminability criterion 
aims at an operator, which clearly discriminates between the target object 
and any other object or situation. Regions of the appearance space, which 
represent views of objects other than the target object or any background 
area, should be given low confidence values. The degree of robustness of an 
operator for object or situation recognition can be specified more formally in 
the PAG terminology. 

Definition 3.1 (PAC-recognition) Let parameter ( define a threshold for 
a certain proximity to the ideal orbit. A function for object or situation 
recognition is said to be PAC-learned subject to the parameters and C, 
if with a probability of at least P^ , the function value of a target pattern 
surpasses and the function value of a counter pattern falls below threshold (. 

In several subsections throughout this chapter, we will present refinements 
and applications of this generic definition of PAC-recognition. Related to the 
problem of learning operators for object or situation recognition, a main 
purpose of visual demonstration is to analyze the conflict between invariance 
and discriminability and And an acceptable compromise. 



3.2.2 Regularization Principles and GBF Networks 

An approach for function approximation is needed which has to be grounded 
on sample data of the input-output relation. The function approximation 
should At the sample data to meet closeness constraints and should generalize 
over the sample data to meet smoothness constraints. Neglecting the aspect 
of generalizing leads to overfitted functions, otherwise, neglecting the fitting 
aspect leads to overgeneralized functions. Hence, both aspects have to be 
combined to obtain a qualified function approximation. The regularization 
approach incorporates both constraints and determines such a function by 
minimizing afunctional [130]. 

Theoretical Background of GBF Networks 

Let L2gbf '■= € (7^"* x TZ);j = I,---, J} be the sample 

data representing the input-output relation of a function / that we want to 
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approximate. The symbol TZ designates the set of real numbers. The func- 
tional F in equation (3.8) consists of a closeness term and a smoothness term, 
which are combined by a factor ^ expressing relative importance of each. 



F{f) ■■= 



'(y,-/(X,))M +M- II F^^if) 



(3.8) 



The first term computes the sum of squared distances between the desired 
and the actual outcome of the function. The second term incorporates a 
differential operator T’®’” for representing the smoothness of the function. 

Under some pragmatic conditions (see again [130]) the solution of the 
regularization functional is given by equation (3.9). 



(3.9) 

i=i 

The basis functions are Gaussians with j G {1,---,J}, specified for a 
limited range of definition, and having Xj as the centers. Based on the non- 
shifted Gaussian basis function we obtain the J versions /j^® by shifting 
the center of definition through the input space to the places Xi, - ■ ■ , Xj. The 
solution of the regularization problem is a linear combination of Gaussian 
basis functions (GBF) (see equation (3.9)). 



Constructing the Set of GBFs 



The number of GBFs must not be equal to the number of samples in Hgbf- 
It is of interest to discover the minimum number of GBFs which are needed 
to reach a critical quality for the function approximation. Instead of using 
the vectors Xi, - ■ ■ , Xj for defining GBFs, we cluster them into / sets (with 
I < J) striving simultaneous for minimizing the variances within and maxi- 
mizing the distances between the sets. A procedure similar to the error-based 
ISODATA clustering algorithm can be used which results in / sets (see Sub- 
section 2.3.1). From each set a mean vector Af,z G {1, •••,/}, is selected 
(or computed) which specifies the center of the definition range of a GBF in 
normal form. 



/f^(A) :=exp 




ll^-^ff^ 

) 



(3.10) 



The function /^® computes a similarity value between the vector Xf and a 
new vector X . The similarity is affected by the pre-specified parameters 
and r, whose multiplicative combinations determine the extent of the GBF. 
Parameter Ui is defined by the variance of elements in the relevant cluster, 
averaged over all components of the m-dimensional input vectors. It is intu- 
itive clear that the ranges of definition of the functions Gi must overlap to a 
certain degree in order to approximate the recognition function appropriately. 
This overlap between the GBFs can be controlled by the factor t. 
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Alternative to spherical Gaussians, a GBF can be generalized by taking 
the covariance matrix Ci into account which describes the distribution of 
cluster elements around the center. 

/f ^(A) := exp • (A - ■ C~^ ■ (A - A,^)) (3.11) 

Determining the Combination Factors of the GBFs 

The linear combination of GBFs (reduced set) is defined by the factors wt. 

I 

/(A):=^rc.-/f^(A) (3.12) 

i=l 

The approach for determining appropriate combination factors is as fol- 
lows. First, the I basis functions are applied to the J vectors Xj of the 
training set. This results in a matrix V of similarity values with J rows and 
I columns. Second, we define an J-dimensional vector Y comprising the de- 
sired output values yi, - ■ ■ ,yj for the J training vectors. Third, we define a 
vector W , which comprises the unknown combination factors wi, ■ ■ ■ ,wi of 
the basis functions. Finally, the problem is to solve the equation V ■ W = Y 
for the vector W. According to Press et al. [134, pp. 671-675], we compute 
the pseudo inverse of V and determine the optimal vector W directly. 

yt := (V^ • V)"^ • V^, W:=V^-Y (3.13) 

The sample data Qgbf have been defined previously as set of elements, 
each one consisting of input vector and output scalar. GBF network learning 
can be generalized in the sense of treating also output vectors instead of 
output scalars. For this case, we simply compute a specific set of combination 
factors of the GBFs for each output dimension, respectively. 

Functionality of GBF Networks 

The use of GBF networks allows a nonlinear dimension reduction from input 
to output vectors (because of the nonlinearity of the Gaussians). Equations 
(3.11) and (3.12) define an approximation scheme which can be used for rele- 
vant functions of object or situation recognition. The approximation scheme 
is popular in the neural network literature under the term regularization neu- 
ral network [21, pp. 164-191], and we call it GBF network to emphasize the 
Gaussians. The approach does not assume a normal density distribution of 
the whole population of input vectors. However, it is assumed that the density 
distribution of input vectors can be approximated by a combination of several 
normal distributed subpopulations. Actually, this is the motivation for using 
Gaussians as basis functions in the network. Each GBF must be responsible 
for a normal distributed subpopulation of input vectors. GBF network learn- 
ing helps to overcome the serious bias problem in high-level machine learning 
[169] and parameter estimation [134]. The dynamic structure of the network. 
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to be changed and controlled on the basis of error feedback [29], lets the 
learning method go beyond pure function approximation. 

Visualization of GBF Networks 

A GBF network consists of an input layer, a layer of hidden nodes and an 
output layer. The input layer and output layer represent the input and output 
of the function approximation, the nodes of the hidden layer are assigned to 
the GBFs (see Figure 3.2). 




Representing Recognition Functions by GBF Networks 

The GBF network defines function / which is an approximation of unknown 
function /. For evaluating the accuracy of the approximation we must apply 
testing data consisting of new input vectors and accompanying desired output 
values. The deviation between computed output and desired output can be 
formalized by defining a function which plays the role of an implicit 
function according to equation (3.1). 

mB,Z):=f{X)-y (3.14) 

Vector B consists of all parameters and combination factors of the Gaussians, 
and vector Z consists of input vector X and desired output value y. We 
obtain a PAG approximation subject to the parameters P’’ and ip, if with a 
probability of at least P’’ the magnitude of /™(P, Z) is less than ip. 

In the following, the paradigm of learning GBF networks is applied for 
the purpose of object recognition. We assume that an individual GBF net- 
work is responsible for each individual object. Goncretely, each GBF network 
must approximate the appearance manifold of the individual object. Based 
on a training set of appearance patterns of an individual object, we obtain 
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the GBF network which defines function /. Without loss of generality, func- 
tion / can be learned such that each appearance patterns of the object is 
transformed to the value 1, approximately. Equation (3.14) is specialized to 

mB,Z):=f(X)-l (3.15) 

By applying equation (3.4) we obtain function f^^{B,Z) whose values are 
restricted in the unit interval. The closer the value at the upper bound (value 
1) the more reliable a pattern belongs to the object. Several GBF networks 
have been trained for individual objects, respectively. By putting a new ap- 
pearance pattern into all GBF networks, we are able to discriminate between 
different objects based on the maximum response among the networks. 

Visualization of GBF Networks Applied to Object Recognition 

Let us assume three patterns from a target object (maybe taken under dif- 
ferent viewing angles) designated by Xi, X 2 , X 3 which are represented in the 
high-dimensional input (pattern) space as points. In the left diagram of Fig- 
ure 3.3 these points are visualized as black disks. We define three spherical 
GBFs centered at these points and make a summation. The result looks like 
an undulating landscape with three hills, in which the course of the ridge 
goes across the hills. Seen from top view of the landscape the ridge between 
the hills can be approximated by three straight lines. In the right diagram 
of Figure 3.3 the result of function is shown when moving along points 
of the ridge. The function is defined in equation (3.4) and is based on equa- 
tion (3.15) specifically. 




Fig. 3.3. (Left) Input space with three particular points which are positions of 
three Gaussians, virtual straight lines of the 2D projection of the ridge; (Right) 
Result of function along the virtual straight lines between the points. 



The manifold of all patterns for which equation (3.1) holds just consists of 
the target patterns denoted by Ai, A 2 , V 3 . By accepting small deviations for 
jGi value 1, which is controlled by C in equation (3.7), we enlarge the 
manifold of patterns and thus make a generalization, as shown in Figure 3.4. 
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Fig. 3.4. (Left) Input space with three particular points, three Gaussians with 
small extents; (Right) Result of function along the virtual straight lines between 
the points, small set of points of the input space surpasses threshold 



The degree of generalization can be controlled by the extents of the Gaus- 
sians via factor r (see equation (3.10)). An increase of t makes the Gaussians 
more flat, with the consequence that a larger manifold of patterns is accepted 
subject to the same threshold C (see Figure 3.5). 




Fig. 3.5. (Left) Input space with three particular points, three Gaussians with 
large extents; (Right) Result of function along the virtual straight lines between 
the points, large set of points of the input space surpasses threshold G 



In various applications, to be presented in Section 3.3, we will treat the 
overgeneralization/ overfitting dilemma. 

3.2.3 Canonical Fhames with Principal Component Analysis 

For an appearance-based approach to recognition the input space is high- 
dimensional as it consists of large-scaled patterns, typically. Due to the 
high-dimensional input space, the training of recognition operators is time- 
consuming. For example, the clustering step in learning an appropriate GBF 
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network is time-consuming, because frequent computations of similarity val- 
ues between high-dimensional training vectors are necessary. This subsection 
introduces the use of canoncical frames which will be combined with GBF 
networks in Subsection 3.4.4. The purpose is to obtain reliable recognition 
operators which can be learned and applied efficiently. 

Role of Seed Images for Recognition 

The increased efficiency and reliability is based on a pre-processing step prior 
to learning and application. It makes use of the following ground truth. Im- 
ages taken under similar conditions, e.g. similar view angles, similar view 
distance, or similar illumination, contain diverse correlations between each 
other. In the sense of information theory, for similar imaging conditions the 
entropy of taking an image is low, and for a significant change of the imag- 
ing conditions the entropy is high. Accordingly, for the purpose of object or 
situation recognition it makes sense to take a small set of important images 
(we call them seed images, seed apperances, or seed patterns) and determine 
a low-dimensional sub-space thereof. Any new image is supposed to be in 
high correlation with one of the seed images, and therefore it is reasonable 
to approximate it in the low-dimensional sub-space. The set of basis vectors 
spanning this sub-space is called a canonical frame ( CF), which is aligned to 
the patterns resulting from the seed images of the object. From the mathe- 
matical point of view, a canonical frame is a coordinate system, however, the 
coordinate axes represent feature vectors. Coming back to the beginning of 
this subsection, the pre-processing step consists in mapping new views into 
the canonical frame. 

Interplay of Implicit Function and Canonical Frame 

The implicit function in equation (3.1), which is responsible for approximat- 
ing a manifold of view patterns, will be represented in the canonical frame. We 
impose three requirements. First, in the canonical frame the implicit function 
should have a simpler description than in the original frame. For example, 
in the canonical frame a hyper-ellipsoid would be in normal form, i.e. the 
center of the ellipsoid is in the origin of the frame, and the principal axes are 
collinear with the frame axes. The dimension of parameter vector B is lower 
compared to the one in the original frame. Second, in the canonical frame 
the equation (3.1) must hold perfectly for all seed patterns which are repre- 
sented as vector Z, respectively. In this case, the parameters in vector B are 
invariant features of the set of all seed patterns. Third, the implicit function 
should consider generalization principles as treated in the paradigms of Ma- 
chine Learning [177, pp. 349-363]. For example, according to the enlarge-set 
rule and the close-interval rule, the implicit function must respond continu- 
ous around the seed vectors and must respond nearly invariant along certain 
courses between successive seed vectors (in the pattern space). For avoiding 
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hazardous decisions, which are caused by over-generalizations, the degree of 
generalization should be low. 

An appropriate canonical frame together with an implicit function can be 
constructed by principal component analysis. The usual application of PC A 
relies on a large set of multi-dimensional, perverably Gaussian-distributed, 
vectors. It computes a small set of basis vectors spanning a lower-dimensional 
sub-space which include rather accurate approximations of the original vec- 
tors. However, we apply PCA to the small set of seed patterns and only move 
the original frame into another one which contains the seed vectors exactly. 
No effective dimension reduction is involved because all seed patterns are 
equal significant. Taking the covariance matrix of the seed patterns into ac- 
count, we use the normalized eigenvectors as basis vectors of unit length. The 
representation of a seed pattern in the canonical frame is by Karhunen-Loeve 
expansion. Implicit function is defined as a hyper-ellipsoid in normal form 
with the half-lengths of the ellipsoid axes defined dependent on the eigenval- 
ues of the covariance matrix, respectively. As a result, the seed vectors are 
located on the orbit of this hyper-ellipsoid, and invariants are based on the 
half-lengths of the ellipsoid axes. 

Principal Component Analysis for Seed Patterns 

Let f2pcA ■= {Xi\Xi e i?™; i = 1, •••,/} be the vectors representing the seed 
patterns of an object. Taking the mean vector we compute the matrix 

M := (3.16) 

It is easy proven that the covariance matrix is obtained by 

C-.= j-M-M'^ (3.17) 

We obtain the eigenvectors of the covariance matrix C. The I 

vectors Ai, • • • , Xj of QpcA can be represented in a coordinate system which 
is defined by just (/ — 1) eigenvectors and the origin of the system. A simple 
example for 1 = 2 shows the principle. The difference vector (Ai — A 2 ) is 
obtained as the first eigenvector Ei and is used as axis of a one-dimensional 
coordinate system. Additionally, the mean vector A° is taken as origin. Then, 
the two vectors Ai and A 2 are located on the axis with certain coordinates. 
A second axis is not necessary, and in the one-dimensional coordinate system 
each of the two vectors has just one coordinate, respectively. This principle 
can be generalized to the case of / vectors Ai , • • • , A/ which is represented 
in a coordinate system of just (/ — 1) axes, and just (/ — 1) coordinates are 
needed for obtaining a unique location. 

Consequently, the principal component analysis must yield at least one 
eigenvalue equal to 0 and therefore the number of relevant eigenvalues 
Ai, • • • , A/_i is at most (/ — 1). The vectors of OpcA represent seed patterns 
which are hardly correlated with each other. It depends on this degree of 
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non-correlation which determines the actual number of relevant eigenvectors, 
e.g. less or equal to {I — 1). 

Let us assume the relevant eigenvectors Ei, - ■ ■ The Karhunen- 

Loeve expansion of a vector X, i.e. the projection of the vector into the 
(7 — l)-dimensional eigenspace, is defined by 

1 := (xi, • • • := (Si, • • • , S/_i)^ • {X - X^) (3.18) 



Ellipsoidal Implicit Functions Based on PCA 



Based on the eigenvalues of the PCA and the Karhunen-Loeve expansion, we 
introduce the following implicit function which defines a hyper-ellipsoid. 



n{B,Z) := 




- 1 



(3.19) 



Notice the close relationship to the special case of a 2D ellipse formula in 
equation (3.2). Input-output vector Z := X := (xi, • • • , ai/_i)^ is defined 
according to equation (3.18). Parameter vector B := {ki, • • • , k/_i)^ contains 
the parameters ki denoting the half-lengths of the ellipsoid axes in normal 
form. We define these parameters as 

Ki := (3.20) 



Let vectors Xi, - ■ ■ ,Xj be the KLE of the seed vectors Ai, • • • , Xj, as 
defined in equation (3.18). For the special cases of assigning these KLE- 
transformed seed vectors to Z, respectively, we made the following impor- 
tant observation. By taking equation (3.19) into account, the equation (3.1) 
holds for all seed vectors, perfectly. That is, all seed vectors have a particular 
location in the canonical frame, namely, they are located on the orbit of the 
defined hyper-ellipsoid.^ The hyper-ellipsoid is an invariant description for 
the set of seed vectors. 

The PCA determines a (/ — l)-dimensional hyper-ellipsoid based on a 
set of I seed vectors, and all these are located on the orbit of the ellipsoid. 
Generally, more than / points are necessary for fitting a unique (/ — 1)- 
dimensional hyper-ellipsoid. The question of interest is, which hyper-ellipsoid 
will be determined by the PCA. It is well known that PCA determines the 
first principal axis by maximizing the variances which are obtained by an 
orthogonal projection of the sample points on hypothetical axis, respectively. 
Actually, this is the bias which makes the fitting unique. 

For illustration, we describe the 2D case of constructing ellipses. It is 
well-known that 5 points are necessary for fitting a unique 2-dimensional 
ellipse, however, only 3 points are available. Figure 3.6 shows two examples 
of ellipses, each fitting the same set of three points. The ellipse in the left 
image has been determined by PCA, and ellipse in the right image has been 



The proof is given in Appendix 1. 



4 
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fitted manually. In the figure, the projection of the sample points on two 
hypothetical axes is shown. The variance on the right is lower than on the 
left, as expected, because the variance on the left is the maximum. 




Fig. 3.6. Ellipses fitted through three points; (Left) Ellipse determined by PCA, 
showing first principal axis, determined by maximizing the variance; (Right) Ellipse 
determined manually with less variance along the first principal axis. 



Motivation for Ellipsoidal Implicit Functions 

A nice property can be observed concerning the aspect of generalization. The 
set of seed patterns is just a subset of the manifold of all patterns for which 
equation (3.1) holds. Actually, the size of this manifold, e.g. the perimeter of 
a 2D ellipse, correlates with the degree of generalization. PCA produces mod- 
erate generalizations by avoiding large ellipsoids. The ellipsoid determined by 
PCA can be regarded as a compact grouping of the seed vectors. This is also 
observed in Figure 3.6, which shows that the left ellipse (produced by PCA) 
is smaller than the right one. 

The ellipsoidal implicit function considers the enlarge-set and the close- 
interval rule as requested by learning paradigms. In the following, this will 
be demonstrated visually. Coming back to the problem of object recognition, 
the question is, which views others than the seed views can be recognized 
via the implicit function. Let us assume three patterns from a target object 
denoted by Xi, X 2 , which are represented in the high-dimensional input 
(pattern) space as points. In the left diagram of Figure 3.7 these points are 
visualized as black disks (have already been shown in Figure 3.6). We de- 
termine the two-dimensional eigenspace and the eigenvalues by PCA, and 
construct the relevant 2D ellipse through the points. The right diagram of 
Figure 3.7 shows a constant value 1 when applying function (as defined 
in equations (3.19) and (3.4)) to all orbit points of the ellipse. Therefore the 
generalization comprises all patterns on the ellipse (close-interval rule). 
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Fig. 3.7. (Left) Input space with three particular points from which a 2D ellipse 
is defined by PCA; (Right) Result of function along the ellipse, constant 1. 



For a comparison with the GBF network approach, one can see the rele- 
vant curve in the right diagram of Figure 3.3. There, a generalization did not 
take place, i.e. equation (3.1) just holds for the Gaussian center vectors. In 
the PGA approach the degree of generalization can be increased furthermore 
by considering the threshold C and accepting small deviations for from 1 . 
The relevant manifold of patterns is enlarged, as shown by the dotted band 
around the ellipse in Figure 3.8 (enlarge-set rule). 




Fig. 3.8. (Left) Input space with three particular points from which a 2D ellipse is 
defined by PCA, small deviations from this ellipse are constrained by an inner and 
an outer ellipse; (Right) Result of function along the ellipse, which is constant 
1, accepted deviations are indicated by horizontal lines with offset =p(). 



Discussion of GBF Approach and PCA Approach 

The following resume finishes this section. In the GBF approach, the gener- 
alization is flexible, because the extent of each Gaussian can be controlled 
individually. In comparison to this, the generalization in the PGA approach 
is less fexible, because the manifold of patterns is enlarged globally by taking 
patterns around the whole ellipse orbit into account. In the GBF approach, 
it is expected that the sample data can be separated such that the distri- 
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bution of the elements in each cluster follows a normal probability density. 
The PCA approach uses a small set of seed vectors (which are not normal 
distributed) for constructing a canonical frame of the appearance patterns of 
an object, and thus reduces the high-dimensional input space significantly. 
The elliptical orbit obtained by PCA describes a global relationship between 
the seed patterns. 

In the next Section 3.3, we train GBF networks for the recognition of 
objects and the scoring of situations. Different configurations and parame- 
terizations of GBF networks affect the balance between invariance and dis- 
criminability of the recognition function. Finally, Section 3.4 combines the 
GBF and PCA approaches according to a strategy such that the advantages 
of each are exploited. The main purpose is to obtain a recognition function 
which obeys an acceptable compromise between invariance, discriminability, 
and efficiency. 



3.3 GBF Networks for Approximation of Recognition 
Functions 

In this section, GBF networks are applied for globally representing a manifold 
of different object appearances (due to changes of viewing conditions) or 
a manifold of different grasping situations (due to movements of a robot 
gripper). Just a small set of distinguished appearance patterns are taken 
from which to approximate the relevant appearance manifold roughly.® In 
the spirit of applying Occam’s razor to object or situation recognition, the 
sparse effort for training and representation may be sufficient in the actual 
application, i.e. reach a certain quality of PAG-recognition.® The purpose 
of visual demonstration and experimentation (with several configurations of 
GBF networks) is to clarify this issue. 



3.3.1 Approach of GBF Network Learning for Recognition 

Object recognition has to be grounded on features which discriminate be- 
tween the target object and other objects or situations and can be extracted 
from the image easily. In our approach, an object is recognized in a certain 
image area by applying a learned recognition function to the signal structure 
of this area. The output of the recognition function should be a real value 
between 0 and 1, which encodes the confidence, that the target object is 
depicted in the image area. Regardless of the different appearance patterns 
of the object the recognition function should compute values near to 1. On 

® Approaches for 3D object recognition which use a small set of privileged or 
canonical views are known as aspect graph methods [137]. 

® Poggio and Girosi introduced the concept of sparsity in approximation techniques 
[131, 64]. 
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the other hand, the recognition function should compute values near to 0 for 
image areas depicting any counter object or situation. 

Alternatively, in certain applications, we need a scoring function for a 
more fine-grained evaluation of situations. This is typically the case in robotic 
servoing processes which deal with continually changing situations (see Chap- 
ter 4). For example, for grasping a target object appropriately, a certain 
function must be responsible for evaluating the grasping stability while ap- 
proaching the object [44]. Based on this, the manipulator is servoed to the 
most stable grasping pose in order to grasp the target object. 

Constructing GBF Networks for Recognition 

GBF networks are used for learning the recognition or scoring function. In 
case of object recognition, we must acquire distinguished samples of the ap- 
pearance manifold of the target object by changing the imaging conditions 
systematically and taking a discrete set of images. In case of situation evalua- 
tion, we must acquire distinguished samples of intermediate situations along 
with the desired scores. Optionally, we transform the relevant image patch 
with specific filters in order to enhance certain properties, or to simplify the 
complexity of the pattern manifold. According to the approach for learning 
a GBF network, the generated set of training patterns must be clustered 
with regard to similarity by taking the required quality of PAG-recognition 
into account {e.g. using the error-based ISODATA algorithm mentioned in 
Subsection 3.2.2). Depending on the actual requirements and on the specific 
strategy of image acquisition, maybe the clustering step can be suppressed if 
each distinguished sample plays a significant role of its own, i.e. each distin- 
guished sample represents a one-element cluster. For each cluster a GBF is 
defined, with the mean vector used as Gaussian center vector, and the covari- 
ance matrix computed from the distribution of vectors in the cluster (for a 
one-element cluster the identity matrix is used as covariance matrix) . The ex- 
tents of the GBFs, defined by the combination of the covariance matrix C and 
a factor r are responsible for the generalizing ability (see equation (3.11)), 
i.e. usefulness of the operator for new patterns (not included in the training 
set). Factor t for controlling the Gaussian overlap is subject of the experi- 
ments (see later on) . The final step of the learning procedure is to determine 
appropriate combination factors of the GBFs by least squares fitting {e.g. 
using the pseudo inverse technique mentioned in Subsection 3.2.2 or singular 
value decomposition). 

Appearance-Based Approach to Recognition 

The learned GBF network represents a recognition or scoring function. The 
application to a new pattern is as follows. The input nodes of the GBF net- 
work represent the input pattern of the recognition function. The hidden 
nodes are defined by I basis functions, and all these are applied to the in- 
put pattern. This hidden layer approximates the appearance manifold of the 
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target object or a course of situations. The output node computes the recog- 
nition or scoring value by a weighted combination of results coming from the 
basis functions. The input space of the GBF network is the set of all possible 
patterns of the pre-defined size, but each hidden node responds significantly 
only for a certain subset of these patterns. Unlike simple applications of GBF 
networks, in this application of object recognition or situation scoring, the di- 
mension of the input space is extremely high (equal to the pattern size of the 
target object, e.g. 15 x 15 = 225 pixel). The high-dimensional input space 
is projected nonlineary into a one-dimensional output space of confidence 
values (for object recognition) or scoring values (for situation evaluation), 
respectively. 

Different Configurations/Parametrizations of GBF Networks 

By carefully spreading and parameterizing the Gaussian basis functions, an 
optimal PAG operator can be learned, which carries out a compromise be- 
tween the invariance and discriminability criterion. The invariance criterion 
strives for an operator, which responds nearly equal for any appearance pat- 
tern of the target object. The discriminability criterion aims at an operator, 
which clearly discriminates between the target object and any other object or 
situation. This conflict is also known under the terms overgeneralization ver- 
sus overfltting. On account of applying the principle of minimum description 
length to the configuration of GBF networks, it is desirable to discover the 
minimum number of basis functions to reach a required quality of PAG func- 
tion approximation. In the experiments of the following subsections we show 
the relationship between number and extents of the GBFs on the one hand 
and the invariance/discriminability conflict on the other hand. The following 
interesting question must be clarified by the experiments. 



How many GBFs are needed and which Gaussian extents are appro- 
priate to reach a critical quality of PAG-recognition ? 



3.3.2 Object Recognition under Arbitrary View Angle 

For learning an appropriate operator, we must take sample images of the 
target object under several view angles. We rotate the object by using a 
rotary table and acquire orientation-dependent appearance patterns (size of 
the object patterns 15 x 15 = 225 pixel). Figure 3.9 shows a subset of eight 
patterns from an overall collection of 32. The collection is devided into a 
training and a testing set comprising 16 patterns each. The training set has 
been taken by equidistant turning angles of 22.5° degrees, and the testing set 
differs by an offset of 10° degrees. Therefore, both in the training and testing 
set the orientation of the object varies in discrete steps over the range of 360° 
degrees. 
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Fig. 3.9. The target object is shown under eight equidistant rotation angles. The 
patterns are used to learn an operator for object recognition under arbitrary view 
angle. 



The collection of GBFs and their combination factors are learned accord- 
ing to the approach of Subsection 3.3.1. By modifying the number and/or 
the extent of the GBFs, we obtain specific GBF network operators. 

Experiments to Object Recognition under Arbitrary View Angle 

In the first experiment, a small extent has been chosen, which implies a 
spare overlap of the GBFs. By choosing 2,4,8, and 16 GBFs, respectively, 
four variants of GBF networks are defined to recognize the target object. 
Figure 3.10 shows the four accompanying curves (a), (b), (c), (d) of confidence 
values which are computed by applying the GBF networks to the target 
object of the test images. The more GBFs are used, the higher the confidence 
values for recognizing the target. The confidence values vary significantly 
when rotating the object, and hence the operators are hardly invariant. 

The second experiment differs from the first in that a large extent of the 
GBFs has been used, which implies a broad overlap. Figure 3.11 shows four 
curves of confidence values, which are produced by the new operators. The 
invariance criterion improves and the confidence nearly takes the desired value 
1. Taking only the invariance aspect into account, the operator characterized 
by many GBFs and large extent is the best (curve (d)). 

The third experiment incorporates the discrimin ability criterion into ob- 
ject recognition. An operator is discriminable, if the recognition value com- 
puted for the target object is significant higher than those of other objects. 
In the experiment, we apply the operators to the target object and to three 
test objects (outlined in Figure 3.12 by white rectangles). Based on 16 GBFs 
we systematical increase the extent in 6 steps. Figure 3.13 shows four curves 
related to the target object and the three test objects. If we enlarge the extent 
of the GBFs and apply the operators to the target object, then a slight in- 
crease of the confidence values occurs (curve (a)). If we enlarge the extent in 
the same way and apply the operators to the test objects, then the confidence 
values increase dramatically (curves (b), (c), (d)). Gonsequently, the curves 
for the test objects approach the curve for the target object. An increase of 
the extent of the GBFs makes the operator more and more unreliable. How- 
ever, according to the previous experiment an increasing extent makes the 
operator more and more invariant with regard to object orientation. Hence, a 
compromise has to be made in specifying an operator for object recognition. 
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Fig. 3.10. Different GBF networks are tested for object recognition under arbitrary 
view angle. The network output is a confidence value, that a certain image patch 
contains the object. The curves (a), (b), (c), (d) show the results under changing 
view angle using networks of 2,4,8,16 GBFs, respectively. The more GBFs, the 
higher the confidence value. Due to a small GBF extent the operators are not 
invariant under changing views. 



Aspects of PAC Requirements in Recognition For this purpose we 
formulate PAC requirements (see Definition 3.1 of PAC-recognition) . First, a 
probability threshold is pre-specified which serves as a description of the 
required quality of the operator, e.g. P'" := 0.9. Second, the maximum value 
for a threshold Ci is determined, such that with the probability of at least P’’ 
the confidence value of a target pattern surpasses Ci. Third, the minimum 
value for a threshold C 2 is determined, such that with the probability of at 
least P’’ the confidence value of a counter pattern falls below C 2 . Finally, if 
Cl is less than C 2 , then we define C as the mean value between both. In this 
case, the recognition function is PAC-learned subject to the parameters P’’ 
and C- Otherwise, the recognition function can not be PAC-learned subject 
to parameter P’’. In this case, another configuration and/or parameterization 
of a GBF network have to be determined. Actually, the ideal GBF network 
would be the one which provides the highest value for probability threshold 
P’’. A sophisticated approach is desirable which has to optimize all unknowns 
of a GBF network in combination.^ 

^ We do not focus on this problem, instead refer to Orr [122]. 
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Fig. 3.11. Similar experiments like the one in Figure 3.10. However, a large extent 
of the GBFs has been used. The learned operators respond nearly invariant nnder 
varying view angles. 



200 200 




Fig. 3.12. The image shows a certain view of the target object (in a bold rectangle) 
and three test objects (in fine rectangles). The GBF network for object recognition 
should detect the target object in this set of four candidates. 
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Fig. 3.13. Six GBF networks have been constructed each with equal GBF number, 
but with different GBF extents. Each GBF network has been applied to the image 
patch of the target object and to the patches of the three test objects. The GBF 
network computes a confidence value, that the patch contains the target object. 
The curves show the confidence values versus the extents of the GBFs. The target 
object (curve (a)) can be discriminated from the test objects (curves (b),(c),(d)) 
quite good by GBF networks of small extents. However, for larger extents the 
discriminating power decreases. 



It has to be mentioned that in our application of the PAG methodology 
a learned operator for object recognition has been validated only for a lim- 
ited set of test examples, but of course not for the infinite set of all possible 
situations. Actually, the probability threshold P’’ is treated as a frequency 
threshold. In consequence of this, there is no guarantee that the obtained 
quality will hold also for other possible imaging conditions in this application 
scenario. However, our approach of quality assessment is the best practical 
way to proceed, z. e. obtaining a useful threshold P’’. The designer must pro- 
vide reasonable training and test scenarios such that the quality estimations 
of the learned operators will prove reliable.® 

® Problems like these are treated with the approach of structural risk minimiza- 
tion which actually is a minimization of the sum of empirical risk and the so- 
called VC- confidence [31]. However, in our work we introduce another approach 
of dealing with ambiguous situations, which is based on gathering additional in- 
formation from the images. The relationship between objects and cameras can 
be changed advantageous, e.g. by continual feedback control of robot manipu- 
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3.3.3 Object Recognition for Arbitrary View Distance 

In this subsection similar experiments are carried out for object recognition 
under arbitrary view distance. In order to learn an appropriate operator, we 
must take sample images of the target object under several spatial distances 
between object and camera. Figure 3.14 shows on the left the image of a 
scene with the target object and other objects taken under a typical object- 
camera distance. On the right, a collection of 11 training patterns depicts 
the target object, which has been taken under a systematic decrease of the 
camera focal length in 11 steps. The effect is similar to decreasing the object- 
camera distance. The size of the object pattern changes from 15 x 15 pixel to 
65 X 65 pixel. We define for each training pattern a single GBF {i.e. avoiding 
clustering), because each pattern encodes essential information. The combi- 
nation factors of the GBFs are determined as before. A further collection of 
10 test images has been acquired, which differs from the training set by using 
intermediate values of the camera focal length. 




Fig. 3.14. On the left, an image of a whole scene has been taken including the 
target object. On the right, a collection of 11 images is taken just from the target 
object under systematic increase of the inverse focal length. The effect is similar to 
decreasing the object-camera distance. This collection of images is used to learn 
an operator for object recognition under arbitrary view distance. 



Experiments to Object Recognition under Arbitrary View Distance 

We constructed three operators for object recognition by taking small, mid- 
dle, and large extent of the GBFs (Figure 3.15). In the first experiment, 
these operators have been applied to the target object of the test images. In 
curve (a) the confidence values are shown for recognizing the target object by 
taking a small extent into account. The confidence value differs significantly 

lator or head (see Subsection 4.3.6 later on), for obtaining more reliable and/or 
informative views. 
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Fig. 3.15. Three GBF networks are tested each with equal GBF number, but 
differing by small, middle, and large GBF extent. Each network is applied to the 
target object in 10 test images, which differ from each other in the size of the 
depicted object, i.e. in the view distance. The network output gives a confidence 
value, that the image patch contains the target object. For small or middle GBF 
extents (curves (a), (b)) the learned operators are hardly invariant under changing 
view distance. For a large extent (curve (c)) an invariance is reached. 



when changing the object-camera distance and is far away from the desired 
value 1. Alternatively, if we use a middle extent value, then the confidence 
values approach to 1 and the smoothness of the curve is improved (curve (b)). 
Finally, the use of a large extent value will lead to approximately constant 
recognition values close to 1 (curve (c)). 

In the second experiment, we investigate the discriminability criterion for 
the three operators from above. The operators are applied to all objects of 
the test image (image on the left in Figure 3.14), and the highest confidence 
value of recognition has to be selected. Of course, it is expected to obtain the 
highest recognition value from the target object. For comparison. Figure 3.16 
depicts once again the confidence values of applying the three operators to 
the target object (curves (a), (b), (c), equal to Figure 3.15)). 

If we apply the operator with large extent value to all objects of the test 
images, then we obtain higher confidence values frequently for objects other 
than the target object (see curve (cl)). In those cases, the operator fails 
to localize the target object. Alternatively, the operator with middle extent 
values fulfills the discriminability criterion better (curve (bl) surpasses curve 
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Fig. 3.16. The cnrves (a), (b), (c) of Figure 3.15 are shown, which are the output of 
three GBF networks (differing by the extent) , when applied just to the patch of the 
target object under varying view distance. In order to consider the reliability of these 
values for discriminating target and other objects the three GBF networks has been 
applied further to the patches of other objects under varying view distance. The 
left image of Figure 3.14 shows all these objects under a certain view distance. Each 
GBF network computes for each object patch an output value and the maximum 
of these values is taken. Repeating this procedure for all three GBF networks and 
for all view distances yield the curves (al), (bl), (cl). For a small GBF extent, the 
curves (a) and (al) are equal, for a middle extent the curve (bl) surpasses curve 
(b) sometimes. For a large extent the curve (cl) surpasses curve (c) quite often. 
Generally, the higher the GBF extent the less reliable the GBF network for object 
recognition. 



(b)) rarely. Finally, the operator with small extent values localizes the target 
object in all test images. The highest confidence values are computed just for 
the target object (curve (a) and curve (al) are identical). Notice again the 
invariance/discriminability conflict which has to be resolved in the spirit of 
the previous section. 



3.3.4 Scoring of Grasping Situations 

So far, we have demonstrated the use of GBF networks for object recogni- 
tion. Alternatively, the approach is well-suited for the scoring of situations, 
which describe spatial relations between objects. We will exemplary illustrate 
specific operators for evaluating grasping situations. A grasping situation is 
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defined to be most stable, if the target object is located between the fin- 
gers entirely. Figure 3.17 shows three images, each depicting a target object, 
two bended grasping fingers, and some other objects. On the left and the 
right, the grasping situation is unstable, because the horizontal part of the 
two parallel fingers is behind and in front of the target object, respectively. 
The grasping situation in the middle image is most stable. For learning to 
recognize grasping stability, we moved the robot fingers step by step to the 
most stable situation and step by step moved off afterwards. The movement 
is photographed in 25 discrete steps. Every second image is used for training 
and the images in between for testing. 




Fig. 3.17. Three typical images of grasping situations are shown. The left and 
the right grasping situations are unstable, the grasping situation in the middle 
is stable. Altogether, a sequence of 13 training images is used, which depict first 
the approaching of the gripper to the most stable grasping situation and then the 
departure from it. This image sequence is used to learn GBF networks for evaluating 
the stability of grasping situations. 



Using Filter Response Patterns instead of Appearance Patterns 

For learning operators, it would be possible to acquire large appearance pat- 
terns containing not only the target object, but also certain parts of the 
grasping fingers. However, the efficiency of recognition decreases if large- 
sized patterns are used. A filter is needed for collecting signal structure from 
a large environment into a small image patch. For this purpose, Pauli et 
al. [127] proposed a product combination of two orthogonal directed Gabor 
wavelet functions [138]. By applying such a filter to the left and the middle 
image in Figure 3.17, and selecting the response of the (black) outlined rect- 
angular area, we obtain the overlay of two response patterns, as shown in 
Figure 3.18. 

A specific relation between grasping fingers and target object results in 
a specific filter response. Based on filter response patterns, a GBF network 
can be learned for scoring situations. The desired operator should compute 
a smooth parabolic curve of stability values for the course of 25 grasping sit- 
uations. For the experiment, we specified many operators by taking different 
numbers and/or extents of GBFs into account. Figure 3.19 shows the course 
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Fig. 3.18. The product combination of two orthogonal directed Gabor wavelet 
functions can be applied to the image patch of grasping situations. This filter re- 
sponds specifically to certain relations between target object and grasping fingers. 
The overlay of the filter response patterns for two different grasping situations 
are shown. According to this, we can represent the finger-object relation by filter 
responses and avoid the difficult extraction of symbolic features. 



of stability values for two operators. The best approximation can be reached 
using a large number and large extent of GBFs (see curve (b)). 

As a resume, we conclude that different configurations and parameteri- 
zations of GBF networks affect the balance between invariance and discrim- 
inability of the recognition or scoring function. The main goal is to obtain a 
recognition function which obeys an acceptable compromise which considers 
also the aspect of efficiency. For this purpose we combine the GBF and PGA 
approaches appropriately. 



3.4 Sophisticated Manifold Approximation for Robust 
Recognition 

We introduce a coarse-to-fine strategy of learning object recognition, in which 
a global, sparse approximation of a recognition function is fine-tuned on the 
basis of space-time correlations and of critical counter situations. This will 
be done by combining PGA with GBFs such that the advantages of both 
approaches are exploited. Furthermore, the technique of log-polar transfor- 
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Fig. 3.19. Based on a sequence of 13 training images (which contain the ap- 
proaching to and the departure from the target object), two GBF networks have 
been learned. They mainly differ by a small and high number of GBFs, respec- 
tively, i.e. from the 13 grasping situations a small and high number of clusters 
are constructed respectively. This image sequence is used for learning a parabolic 
curve of grasping stability where the maximum should be reached for the middle 
image of the sequence. Then each GBF network is applied to a succession of 25 
different grasping situations depicting once again the approaching and departure. 
The images include both the 13 training situations and 12 test situations. If using a 
network with small GBF number, then the resulting course (a) of grasping stability 
is not the desired one. However, the course (b) resulting from the network with 
high GBF number is a good approximation of the desired parabolic curve. It can 
be used for appropriate evaluating grasping situations. 



mation will be applied for reducing the manifold complexity in order to obtain 
an efficient recognition function. 



3.4.1 Making Manifold Approximation Tractable 

The purpose of recognition is to distinguish the pattern manifold of the target 
object from the manifolds of counter objects or counter situations. The in- 
variance characteristic of a recognition function makes sense only if a certain 
level of discriminability is included. This aspect must be considered more 
directly in a strategy of acquiring a recognition function. The complexity 
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of the boundary between target and counter manifolds is directly correlated 
with the complexity of the recognition function, i.e. its description length. 
However, a recognition function with large description length is non-efficient 
in application, which can be realized exemplary for the case of using GBF 
networks. The center patterns of all GBFs must be compared with the input 
pattern, and therefore, both the size of patterns and the number of GBFs 
affect the complexity of the recognition function. In the previous Section 3.3, 
we acquired GBF networks exclusively for object recognition under different 
view angles or different view distances. If both variabilities have to be consid- 
ered in common, possibly including furthermore a variable illumination, then 
the complexity of the pattern manifold increases significantly, and in conse- 
quence of this, much more GBFs are needed for obtaining an appropriate 
GBF network. This raises also the issue of how much images to take under 
which conditions. It is desirable to keep the effort of visual demonstration as 
low as possible and generalize appropriately from a small set of examples. 



In summary, the recognition function should reach a certain level of 
robustness and both acquisition and application should be efficient. 



Object recognition and situation scoring are embedded in a camera- 
equipped robot system in which the agility of cameras can be exploited to 
execute a visual inspection task or a vision-supported robotic task. This as- 
pect is the driving force to work out concepts of solution for the conglomerate 
of problems presented above. 

Constraining the Possible Relationships 

Depending on the specific task and the specific corporeality of the robot, the 
possible relationships between objects and cameras are constrained. E.g., a 
camera fastened on a robot arm can move only in a restricted working space, 
and, therefore, the set of view angles and the set of view distances relative to 
a fixed object is restricted. The complexity of the pattern manifold decreases 
by considering in the training process only the relevant appearances. 

The relation between objects and cameras changes dynamically. E.g., 
a manipulation task is solved by continuous control based on visual infor- 
mation. The complexity of the recognition or scoring function decreases by 
putting/keeping the camera in a specific spatial relation to the trajectory of 
the manipulator, e.g. normal to the plane of a 2D trajectory. 

Simplification of Camera Movements 

In a robot-supported vision system the camera position and orientation 
changes dynamically for solving surveillance or inspection tasks. Different 
manipulator and/or camera trajectories are conceivable. As an important 
criterion, the trajectory should facilitate a simplification of manifolds respec- 
tively their boundaries, which is leading to efficient recognition or scoring 





136 3. Manifolds for Object and Situation Recognition 



functions. Frequently, it is advantageous to simplify the camera movement 
by decoupling rotation and translation and doing specialized movements in- 
terval by interval. For example, first the camera can be rotated such that the 
optical axis is directed normal to an object surface and second the camera can 
approach by translating along the normal direction. By log-polar transforma- 
tion the manifold of transformed patterns can be represented more compactly, 
in which only shifts must be considered and any scaling or turning appear- 
ances are circumvented. 

In a robot-supported vision system the task of object recognition can 
be simplified by first moving the camera in a certain relation to the object 
and then apply the operator, which has been learned just for the relevant 
relation. For example, by visual servoing the camera can be moved towards 
the object such that it appears with a certain extension in the image. The 
servoing procedure is based on simple, low-dimensional features like contour 
length or size of the object silhouette. The manifold for object recognition 
just represents variable object appearances which result from different view 
angles. However, appearance variations due to different view distances are 
excluded because of the prior camera movement. 

Exploiting Gray Value Correlations in Time 

Natural images have strong gray value correlations between neighboring pix- 
els. A small change of the relationship between object and camera yields 
a small change of the image and the correlation between pixels reveal also 
a correlation in time.® For the description of the manifold respectively the 
boundaries between manifolds, which is the foundation for object recognition 
or situation scoring, we can exploit these space-time correlations. The use of 
space-time correlations will help to reduce the number of training samples 
and consequently reduce the effort of learning. 

Extraordinary Role of Key Situations and Seed Views 

The task-solving process of a camera-equipped robot system is organized 
as a journey in which a series of intermediate goals must be reached. Ex- 
amples for such goals are intermediate effector positions in a manipulation 
task or intermediate camera positions in an inspection task. In the context 
of vision-controlled systems, the intermediate goals are key situations which 
are depicted in specific images (seed images, seed views). The seed views 
barely have gray value correlations because of large periods of time or large 
offsets of the points of view between taking the images. However, the seed 
views approximate the course of the task-solving process, and thus will serve 
as a framework in a servoing strategy. Later on, we will use seed views as a 
foundation for learning operators for object recognition and situation scoring. 

® In image sequence analysis this aspect is known as smoothness constraint and is 
exploited for determining the optical flow [80]. 
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3.4.2 Log-Polar Transformation for Manifold Simplification 

The acquisition and application of recognition functions can be simplified by 
restricting the possible movements of a camera relative to a target object. 
Let us make the theoretical assumption (for a moment), that the possible 
movements just comprise a rotation around or a translation along the optical 
axis, and that the object is flat with the object plane normal to the optical 
axis. Then the rotation or scaling of appearance patterns corresponds to 
shifting of log-polar patterns after log-polar transformation [23] . If the camera 
executes the restricted movements, then a simple cross correlation technique 
would be useful to track the object in the LPT image. For this ideal case the 
manifold of patterns has been reduced to just a single pattern. 

In realistic applications, presumably, the objects are of three-dimensional 
shape, probably, the camera objectives cause unexpected distortions, and 
possibly, the optical axis is not exact normal to the object surface. Because 
of these realistic imponderables, certain variations of the LPT patterns occur, 
and the purpose of visual demonstration is to determine the actual manifold. 
It is expected that the manifold of LPT patterns is much more compact and 
easier to describe than the original manifold of appearance patterns, e.g. can 
be represented by a single GBF with normal density distribution. 

Principles of Log-Polar Transformation 

The gray value image is separated into a foveal component around the 
image center which is a circle with radius and a peripheral compo- 
nent around the fovea which is a circle ring with maximum radius The 

cartesian coordinates {x\,X 2 \ of the image pixels are transformed into polar 
coordinates {p,0} under the assumption of taking the center of the original 
image as origin. We define the log-polar transformation such that only the 
peripheral component is considered (shown on the left in Figure 3.20 as circle 
ring which is devided in sub-rings and sectors). Both the fovea and the re- 
maining components at the image corners are suppressed (shown on the left in 
Figure 3.20 as gray-shaded areas). In the context of log-polar transformation 
it is convenient to accept values for 0 in the interval [—90°, • • • , -1-270°]. Pa- 
rameter p takes values in the interval [p'"*”, • • • , p™“®[, with the radius of 
the peripheral component. The set of cartesian coordinate tuples {{x\,X 2 Y^} 
of the peripheral component is denoted by . 

Let be the discretized and quantized LPT image with columns and 
Jh rows and coordinate tuples {(^ 1 ,^ 2 )^}. For an appropriate fitting into 
the relevant coordinate intervals of the LPT image we define as follows. 

The proof for this theorem under the mentioned theoretical assumption is simple 
but will not be added, because we concentrate on realistic camera movements 
and pattern manifolds. 
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Fig. 3.20. (Left) Cartesian coordinate system {Xi,X 2 } of image pixels, parti- 
tioned into foveal circle, peripheral circle ring devided in sub-rings and sectors, and 
remaining components at the image corners; (Right) Log-polar coordinate system 
{Vi,V 2 } with horizontal axis for the angular component and the vertical axis for 
the radial component, mapping cake-piece sectors from (the periphery of) the gray 
value image into rectangle sectors of the LPT image. 



t'* := log(p-”«-) , ft* := (3.21) 

Definition 3.2 (Log-polar transformation of coordinates) The log- 
polar transformation of cartesian coordinates is of functionality 
[0, • • • , Ju, — 1] X [0, • • • , — 1], and is defined by 

f'^^{xi,X 2 ) ■= round{h°‘ ■ {9 -b 90°)) , 

f^‘^{xi,X 2 ) ■= round{hf ■ (log(p) — h^°)) ( 3 . 22 ) 

Vi := r'^{xi,X2) , V2 ■■= r‘^{xi,X2) ( 3 . 23 ) 

Notice that subject to the resolution of the original image and the res- 
olution of the LPT image X^, the transformation defined by equations ( 3 . 21 ), 
( 3 . 22 ), and ( 3 . 23 ) perhaps is not surjective, i.e. some log-polar pixels are un- 
defined. However, an artificial over-sampling of the original image would solve 
the problem. Furthermore, we notice that in the peripheral component pre- 
sumably several image pixels are transformed into just one log-polar pixel. 
This aspect has to be considered in the definition of log-polar transformation 
of gray values by taking the mean gray value from the relevant image pixels. 
For each log-polar pixel (ui,U 2 )^ we determine the number of image pixels 
(xi,X 2 )'^ which are mapped onto this log-polar pixel. 
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r"{vi,V2) X! ^ (3.24) 

cond 

cond:= {{xi,X 2 )^ A (/*'^(a;i, X 2 ) = wi) A 

(/"^(a;i,a:2) = ^ 2 ) (3.25) 

Definition 3.3 (Log-polar transformation of gray values) The log- 
polar transformation of gray values is of functionality [0, • • • , — 1] x 

[0, • • • , — 1] ^ [0, • • • , 255], and is defined by 

I^{vi,V 2 ) ■■= round ( ^ I'^(a;i,a; 2 ) ) (3.26) 

The left and right picture in Figure 3.20 show the LPT principle, i.e. 
mapping cake-piece sectors from (the periphery of) the gray value image 
into rectangle sectors of the LPT image. Corresponding sectors are denoted 
exemplary by the symbols a,b,c. Based on the definitions of log-polar trans- 
formation, we make real-world experiments using a relevant object and an 
appropriate camera. The purpose of the realistic visual demonstration is to 
determine the actual variation of an LPT pattern if the object is rotating 
around the optical axis or the distance to the optical center of the camera 
is changing. Because of several imponderables in the imaging conditions and 
the inherent three-dimensionality of an object, we have to consider deviations 
from exact invariance. In the sense of the relevant discussion in Subsection 
1.4.1, we are interested in the degree of compatibility between invariants of 
object motion and changes of the LPT patterns of the view sequence. 

Experiments to Log-Polar Transformation 

For this purpose an object has been put on a rotary table and the camera has 
been arranged such that the optical axis goes through the center of rotation 
with a direction normal to the rotary plane. However, this arrangement can 
only be reached roughly. A rotation of an appearance pattern around the im- 
age center should yield a translation of the respective LPT pattern along the 
axis Vi. For illustration Figure 3.21 shows two images of an integrated circuit 
(IC) object under rotation being 90° apart. These are two examples from a se- 
ries 24 discrete orientations spanning equidistantly the interval [0°, • • • , 360°]. 
Figure 3.22 shows the horizontal translation of the LPT pattern in horizontal 
direction, however a small deviation between the patterns occurs. 

A scaling of an appearance pattern with the scaling origin at the image 
center should yield a translation of the LPT pattern along the axis V 2 - In 
reality the scaling is reached by changing the view distance. Figure 3.23 
shows again two images of the IC object with the same rotation angles as 
in Figure 3.21, but with a shorter distance to the camera. Although the 
object appears larger in the original images, the LPT patterns are of roughly 
the same size as before, but are translated in vertical direction (compare 
Figure 3.22 and Figure 3.24). 
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Fig. 3.21. Integrated circuit object under rotation by turning angle 90°, two 
images are taken under large viewing distance. 




Fig. 3.22. Horizontal translation and small variation of the LPT pattern origi- 
nating from the rotating object in Figure 3.21. 



Variation of the LPT Patterns under Object Rotation and Scaling 

An approach is needed for describing the actual variation of the LPT pattern 
more concretely. We present a simple technique which is based on histograms 
of gray values or histograms of edge orientations. First, for the set of LPT 
patterns the variation of the accumulations of the respective gray values are 
determined. This gives a measurement of possible enlargements or shrinkages 
of the LPT pattern. Second, for the set of LPT patterns the variation of the 
accumulations of the respective edge orientations are determined. This gives 
a measurement of possible rotations of the LPT pattern. 

Our image library consists of 48 images which depict the IC object under 
rotation in 24 steps at two different distances to the camera (see Figure 3.21 
and Figure 3.23). The histograms should be determined from the relevant 
area of the LPT pattern, respectively. To simplify this sub-task a nearly 
homogeneous background has been used such that it is easy to extract the 
gray value structure of the IC object. 

In the first experiment, we compute for the extracted LPT patterns of 
the image library a histogram of gray values, respectively. Figure 3.25 (left) 
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Fig. 3.23. Integrated circuit object under rotation by turning angle 90°, two 
images are taken nnder small viewing distance. 




Fig. 3.24. Horizontal translation of the LPT pattern originating from the rotating 
object in Fignre 3.21, and vertical translation compared to Fignre 3.22 which is due 
to the changed viewing distance. 



shows a histogram determined from an arbitrary image. The mean histogram 
is determined from the LPT patterns of the whole set of 48 images as shown 
in Figure 3.25 (right). Next we compute for each histogram the deviation 
vector from the mean histogram, which consists of deviations for each gray 
value, respectively. From the whole set of deviations once again a histogram 
is computed which is shown in Figure 3.26. This histogram resembles a Gaus- 
sian probability distribution with the maximum value at 0 and the Gaussian 
turning point approximately at the value ±10. Actually, the Gaussian extent 
describes the difference between reality and simulation. As opposed to this, if 
simulated patterns are used and a perfect simulation of the imaging condition 
is considered, then the resulting Gaussian distribution would have the extent 
0, i.e. the special case of an impulse. 

In the second experiment, we compute for the extracted LPT patterns 
of the image library a histogram of gradient angles of the gray value edges, 
respectively. Figure 3.27 (left) shows a histogram determined from an arbi- 
trary image. The mean histogram is determined from the LPT patterns of 
the whole set of 48 images as shown in Figure 3.27 (right). 
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Fig. 3.25. (Left) Histogram of gray values computed from the relevant LPT 
object pattern of an arbitrary LPT image in Figure 3.22 or Figure 3.24; (Right) 
Mean histogram computed from all relevant LPT object patterns of a whole set of 
48 images. 




Fig. 3.26. Accumulation of gray value deviations by comparing the 48 histograms 
of gray values (taken from a set set of 48 images) with the mean histogram shown 
in Figure 3.25 (right). 



Next, we compute for each histogram the deviation vector from the mean 
histogram. From the whole set of deviations once again a histogram is com- 
puted which is shown in Figure 3.28. This histogram can be approximated 
once again as a GBF with the maximum value at 0 and the Gaussian turn- 
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Fig. 3.27. (Left) Histogram of edge orientations computed from the relevant LPT 
object pattern of an arbitrary LPT image in Figure 3.22 or Figure 3.24; (Right) 
Mean histogram computed from all relevant LPT object patterns of a whole set of 
48 images. 



ing point approximately at the value ±5. In a simulated world the Gaussian 
distribution would have the extent 0. 
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Fig. 3.28. Accumulation of orientation deviations by comparing the 48 histograms 
of edge orientation (taken from a set of 48 images) with the mean histogram shown 
in Figure 3.27 (right). 
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Compatibility Properties under Log-Polar Representation 

These experiments show exemplary that in reality the log-polar patterns 
change slightly when the object is rotating or moving towards the camera. 
The theoretical concept of invariance must be relaxed by the practical concept 
of compatibility. Related to the Gaussian approximation of the histograms, 
the invariance is characterized by the special GBF with extent 0, but com- 
patibility is characterized by an extent beyond 0. 

The compatibility properties under log-polar representation are useful 
most of all for robot-supported vision systems with active cameras. For de- 
tailed object inspection a camera should approach in normal direction to the 
object plane, because with this strategy the tracking procedure of LPT pat- 
terns is simplified. In this case, the relevant manifold of LPT patterns, which 
is represented by the recognition function, simply consists of one pattern to- 
gether with slight deviations. Actually, the degree of compatibility determines 
the acceptance level of the recognition function. If the recognition function is 
represented by a GBF network (see Subsection 3.2.2) then one single GBF is 
supposed to be enough, whose center vector is specified by the typical LPT 
pattern and whose Gaussian extent is specified by the accepted deviations. 

Log- Polar Representation of Images of 3D Objects 

However, if objects are of three-dimensional shapes and/or are sequentially 
viewed under non-normalized conditions, then LPT will not reduce the com- 
plexity of pattern manifolds. Obviously, the compatibility properties un- 
der log-polar representation are invalid if the observed object is of three- 
dimensional shape, because the normal directions of certain faces differ sig- 
nificantly from the optical axis of the camera and certain object faces appear 
or disappear under different viewing angles. For tall objects both the top face 
and side faces come into play, but all of them can not be orthogonal to the 
direction of camera movement. However, a conceivable approach is to work 
only with the top face of an object. Perhaps, the boundary of the top face 
can be extracted with the approaches in Ghapter 2. 

The nice properties of LPT are likewise invalid if the rotation axis and 
the optical axis of the camera differ significantly. It is the purpose of visual 
demonstration and experimentation to determine for certain camera move- 
ments the degrees of compatibility and find out a movement strategy such 
that a certain level of simplicity of the tracking process is reached. The mea- 
sured degree of compatibility can be used as criterion for arranging the camera 
prior to the application phase automatically. For example, in a servoing ap- 
proach the camera can be rotated such that the optical axis is directed normal 
to the object plane (see Ghapter 4). In the application phase the camera can 
move along the optical axis and a simple cross-correlation technique would 
be useful to track the object in the LPT images. This brief discussion drew 
attention to the following important aspect. 
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For simplifying the appearance manifold, the log-polar transforma- 
tion must be combined with procedures for boundary extraction and 
must also rely on servoing techniques for reaching and keeping ap- 
propriate camera-object relations. 



Depending on the specific task, performed in the relevant environment, 
the conditions for applying log-polar transformation may or may not be at- 
tainable. The following subsection assumes that LPT can not be applied. 
Instead, an approach for a sophisticated approximation of manifolds is pre- 
sented which exploits space-time correlations in an active vision application. 



3.4.3 Space-Time Correlations for Manifold Refinement 

The approach assumes that the pattern variation of the manifold can be 
represented as a one-dimensional course, approximately. In an active vision 
application, the assumption holds if a camera performs a one-dimensional 
trajectory around a stationary object, or the object moves in front of a sta- 
tionary camera. More complicated pattern variations, probably induced by 
simultaneous movements of both camera and object or additional changes of 
lighting conditions, are not accepted. Restricted to the mentioned assump- 
tion, we present a strategy for refined manifold approximation which is based 
on a GBF network with a specific category of hyper-ellipsoidal basis functions. 
The basis functions are stretched along one direction, which is determined 
on the basis of the one-dimensional space-time correlations. 

Variation of the Gray Value Patterns under Object Rotation 

As opposed to the previous subsection, in the following experiments we do 
not apply LPT. Instead, we make measurements in the original gray value 
images, but restrict our attention to the distribution of edge orientations. 
A three-dimensional transceiver box is put on a rotary table which is rotat- 
ing in discrete steps with offset 5°, i.e. altogether 72 steps in the interval 
[0°, • • • , 360°]. For each step of rotation a camera takes an image under a 
constant non-normal viewing direction of angle 45° relative to the table. Fig- 
ure 3.29 shows four images from the whole collection of 72. The computation 
of gradient magnitudes followed by a thresholding procedure yields a set of 
gray value edges, as shown in the binary image, respectively. 

Related to each binary image we compute from the set of edges the dis- 
cretized orientations, respectively, and determine the histogram of edge ori- 
entations. The discretization is in steps of 1° for the interval [0°, • • • , 180°]. 
Figure 3.30 shows in the left diagram an overlay of four histograms from the 
four example images depicted previously, and shows in the right diagram the 
mean histogram computed from the whole collection of 72 images. 
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Fig. 3.29. (Top) Four gray value images of a transceiver box under rotation in 
discrete steps of turning angle 5°; (Bottom) Binarized images of extracted gray 
value edges. 




Fig. 3.30. (Left) Overlay of four histograms of edge orientations computed for the 
four images in Figure 3.29, (Right); Mean histogram of edge orientations computed 
from a whole set of 72 images. 



The distribution of deviations from the mean histogram is shown in Fig- 
ure 3.31. The main difference when comparing this distribution with that in 
Figure 3.28 is that the maximum accumulation is not reached for the value 
0° of orientation deviation but is reached far beyond (approximately value 
—70°), i.e. the compatibility property under rotation does not hold. 

For general situations like these, the manifold of appearance patterns 
can be represented by a GBF network consisting of more than one GBF 
(see Subsection 3.2.2). We are interested in a sparse approximation of the 
recognition function which is characterized by a minimum number of GBFs. 
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Fig. 3.31. Accumulation of orientation deviations by comparing the 72 histograms 
of edge orientation, taken from a set of 72 images, with the mean histogram shown 
in Figure 3.30 (right). 



In Section 3.3, a reduction of the number of GBFs has been reached with 
a clustering procedure which approximates each sub-set of similar patterns 
by just one typical pattern, respectively. However, we completely disregarded 
that the application of recognition functions takes place in a task-solving 
process, in which the relation between object and camera changes continually. 
The principle of animated vision should also be considered in the learning 
procedure, which will help to reduce the effort of clustering. In consensus 
with a work of Becker [16], one can take advantage of the temporal continuity 
in image sequences. 

Temporal Continuity between Consecutive Images 

The temporal continuity can be observed exemplary in a series of histograms 
of edge orientations for an object under rotation. Figure 3.30 (left) showed 
four histograms (a,b,c,d) for the object in Figure 3.29, which has been rotated 
slightly in four discrete steps of 5°, respectively. The histogram curves moved 
to the right continually under slight object rotation. 

A further example of temporal continuity is based on gray value correla- 
tions within and between images. For illustration, once again the transceiver 
box is used, but now images are taken in discrete steps of 1° in the orientation 
interval [0°, • • • , 35°]. Figure 3.32 shows eight example images from the collec- 
tion of 36 under equidistant angle offsets. The correlations can be observed 

The variation of the accumulation values is due to changing lighting conditions 

or due to the appearing or disappearing of object faces. 
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easily by extracting the course of gray values at a single image pixel. For 
example, we selected two such pixels which have been marked in the images 
of Figure 3.32 by a black and a white dot, respectively. 




Fig. 3.32. Eight example images from the collection of 36 under equidistant 
turning angles, and overlay with a white and a black dot at certain pixel positions. 



The courses of gray values at these two points are shown in the diagrams 
of Figure 3.33. For certain intervals of time a piece-wise linear approximation 
of the gray value variation seems to be appropriate. This piece-wise linearity 
is an indication for reasonably assuming space-time correlation of gray values. 





Fig. 3.33. (Left) Course of gray value at position of white dot in Figure 3.32 for 
36 images of the rotating object; (Right) Relevant course at position of black dot. 
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Generally, gray value correlations only hold for small spatial distances in 
the image or short time durations in a dynamic object/camera relationship. 
This aspect must be considered when exploiting the space-time correlations 
for the manifold approximation. We would like to approximate the manifold 
on the basis of a sparse set of training data, because this will reduce the 
effort of learning. Actually, this requirement of sparseness can be supported 
by making use of the space-time correlations. 

Considering Space-Time Correlations at Seed Views 

We determine space-time correlations for a small set of seed views and con- 
struct specific GBFs thereof, so-called hyper- ellipsoidal Gaussian basis func- 
tions. As already discussed at the beginning of this Subsection 3.4.3, we 
assume a one-dimensional course of the manifold approximation. Therefore, 
each GBF is almost hyper-spherical except for one direction whose GBF 
extent is stretched. The exceptional direction at the current seed view is 
determined on the basis of the difference vector between the previous and 
the next seed view. Actually, this approach incorporates the presumption of 
approximate, one-dimensional space-time correlations. 

For illustrating the principle, we take two-dimensional points which rep- 
resent the seed views. Figure 3.34 shows a series of three seed views, i.e. 
previous, current and next seed view. At the current seed view the construc- 
tion of an elongated GBF is depicted. Actually, an ellipse is shown which 
represents the contour related to a certain Gaussian altitude. 




Fig. 3.34. Principle of constructing hyper-ellipsoidal basis functions for time-series 
of seed vectors. 



The GBF extent along this exceptional direction must be defined such that 
the significant variations between successive seed views are considered. For 
orthogonal directions the GBF extents are only responsible for taking random 
imponderables into account such as lighting variations. Gonsequently, the 
GBF extent along the exceptional direction must be set larger than the extent 
along the orthogonal directions. It is reasonable to determine the exceptional 
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GBF extent dependent on the euclidean distance measurement between the 
previous and the next seed view. 

Generally, the mentioned approach can be formalized as follows. Let 
be the time series of seed vectors. A hyper-spherical Gaus- 
sian is defined with the center vector A®, and extent parameter ai is set 
equal to 1. Later on, it will become obvious that there is no loss of generality 
involved with the constant extent value. 



f^\X):=exp{-\\X-X^f) (3.27) 

The Gaussian computes equal values for vectors X located on an m- 
dimensional hyper-sphere around the center vector A®. However, we are in- 
terested in an m-dimensional hyper-ellipsoid. Specifically, (to — 1) ellipsoid 
axes should be of equal half-lengths, i.e. K 2 = ks = • • ■ = Km, and one el- 
lipsoid axis with half-length k\, which should be larger than the others. For 
this purpose the hyper-spherical Gaussian is modified as follows. We take in 
the time-series of seed vectors relative to the current vector A® the previous 
vector A®_^ and the next vector A?^_^. Let us define two difference vectors 

Ai:=A-A®, A^=A®+l-A®_l (3.28) 

The angle (pi between both difference vectors is computed as 

( j 

The modifying expression for the Gaussian is defined as 
/™®(A) := {m ■ cos{(pi)Y + (k 2 • sin((()j))2 



(3.29) 

(3.30) 



Parameter k\ is defined on the basis of the euclidean distance between vectors 
A®_i and A®+i, e.g. as the half of this distance. Parameter K 2 may be defined 
by a certain percentage a of ki, a G [0, • • • , 1]. 



Ki := 



\m 

2 



K2 '’= OL ' 



(3.31) 



Finally, the modified Gaussian is as follows 
/f™(A) :=/f®(A)-/r(A) 



(3.32) 



It can be proven, that equation (3.32) is equal to a Gaussian with Maha- 
lanobis distance between input vector A and center vector A®. The underly- 
ing hyper-ellipsoid is of the specific form as explained above. 

Although the approach is very simple, both efficiency and robustness of 
the recognition function increases significantly (see later on). 



Validation of Space-Time Correlations at Seed Views 

A realistic motivation of the principle can be presented for the series of 36 
images of the transceiver box, e.g. a subset of eight has already been shown 
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in Figure 3.32. We define symbols Xp S ^ designate the im- 

ages. For each image Xp, we take the two gray values at the positions marked 
by the white and black dot and combine them to a two-dimensional feature 
vector Xi, respectively, i.e. altogether 36 possible vectors. The two courses of 
gray values have already been shown in Figure 3.33. A subset of eight seed 
images , ■ ■ ■ ,Xg^ is selected with equidistant angle offset 5°. Actually 

these are the ones depicted in Figure 3.32. Based on the feature vector in each 
seed image, we obtain the seed vectors X{ := Xq, A| := A 5 , • • • , A| := A 35 . 
By applying the approach summarized in equation (3.32), it is possible to 
construct an elongated Gaussian around the seed vectors X^, - ■ ■ ,X^, re- 
spectively. In Figure 3.35, each picture represents the two-dimensional fea- 
ture space for the pair of gray values. The big dot marks the seed vector 
Xf and the two dots of medium size mark the seed vectors Xf_^ and 
respectively. The small dots represent the feature vectors Xi collected from 
a series of 11 images Xp, half by half taken prior and after the image which 
belongs to the seed vector A|, i.e. these are the images Xp e {Xp, • • • ,Xp}, 
with k := (z — 2) -5, n := z-5. We observe for nearly all pictures in Figure 3.35, 
that the constructed ellipses approximate the distribution of relevant feature 
vectors quite good. Especially, the ellipsoid approximation is more appropri- 
ate than circles which would originate from radial symmetric Gaussians. 

Exploiting Space-Time Correlations for Object Recognition 

The usefulness of constructing elongated Gaussians can be illustrated for the 
task of object recognition. As opposed to the previous example in which a 
two-dimensional feature space of pairs of gray values has been used, in the 
following we will consider histograms of edge orientations. The left diagram 
in Figure 3.30 shows four example histograms which have been determined 
from images of the transceiver box under four successive rotation steps. The 
angles are discretized in integer values of the set {0, • • • , 179} and therefore 
the feature vector consists of 180 components. 

We would like to construct a recognition function which makes use of the 
temporal continuity involved in object rotation. For the purpose of training, 
the transceiver box is rotated in steps of 10° and all 36 training images are 
used as seed images. The computation of gradient magnitudes followed by a 
thresholding procedure yields a set of gray value edges for each seed image, 
respectively. From each thresholded seed image a histogram of edge orien- 
tations can be computed. A GBF network is learned by defining elongated 
GBFs according to the approach presented above, i.e. using the histograms of 
the seed images as the Gaussian center vectors and modifying the Gaussians 
based on previous and next seed histograms. Parameter ki in equation (3.30) 
is specified as half of the euclidean distance between the previous and the next 
seed vector {X-_^ and and K 2 is specified by 0.33 • k\. In the GBF 




Fig. 3.35. Six pictures showing the construction of elongated Gaussians for six 
seed vectors (originating from six seed images), each picture represents the two- 
dimensional feature space of a pair of gray values, the black dots represent the gray 
value pair belonging respectively to images prior and after the seed image. 
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network the combination factors for the Gaussians are determined by the 
pseudo inverse technique. 

For assessing the network of elongated Gaussians, we also construct a net- 
work of spherical Gaussians and compare the recognition results computed 
by the two GBF networks. The testing views are taken from the transceiver 
box but different from the training images. Actually, the testing data are sub- 
divided in two categories. The first category consists of histograms of edge 
orientations arising from images with a certain angle offset relative to the 
training images. Temporal continuity of object rotation is considered purely. 
For these situations the relevant recognition function has been trained partic- 
ularly. The second category consists of histograms of edge orientations arising 
from images with angle offset and are scaled, additionally. The recognition 
function composed of elongated Gaussians should recognize histograms of the 
first category robustly, and should discriminate clearly the histograms of the 
second category. The recognition function composed of spherical Gaussians 
should not be able to discriminate between both categories, which is due to 
an increased generalization effect, i.e. accepting not only the angle offsets but 
also scaling effects. 

The desired results are shown in the diagrams of Figure 3.36. By applying 
the recognition function of spherical Gaussians to all testing histograms, we 
can hardly discriminate between the two categories. Instead, by applying the 
recognition function of elongated Gaussians to all testing histograms, we can 
define a threshold for discriminating between both categories. 

Discussion of the Learning Strategy 

Based on the results of these experiments, we draw the following conclusion 
concerning the strategy of learning. For a continually changing relation be- 
tween object and camera we have observed space-time correlations of gray 
values. That is, the orbit of patterns is piece-wise continuous and can be 
approximated piece-wise linear. In consequence of this, only a reduced set of 
seed views needs to be taken into account for learning a recognition function. 
From the seed views we not only use the individual image contents but also 
the gray value relations between successive seed views. 



In the process of learning operators for object recognition the tem- 
poral continuity of training views can be exploited for the purpose of 
reducing the number of views. 



This motivates the use of a sparse set of elongated Gaussians for construct- 
ing a fine-tuned approximation of recognition functions. The approach over- 
comes the need of a large set of training views and avoids the time-consuming 

A simple post-processing strategy is conceivable for reducing the number of 
GBFs. According to this, neighboring seed vectors Af and can be collected 
if the respective difference vectors Af and are approximately collinear. 
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Fig. 3.36. Confidence values of recognizing an object based on histograms of edge 
orientations. For testing, the object has been rotated by an offset angle relative to 
the training images (result given in curve a), or the object has been rotated and 
the image has been scaled additionally relative to the training images (result given 
in curve b). (Left) Curves show the courses under the use of spherical Gaussians, 
both categories of testing data can hardly be distinguished; (Right) Curves show 
the courses under the use of elongated Gaussians, both categories of testing data 
can be distinguished clearly. 



clustering process. In our previous experiments, we took the training views 
in discrete steps by changing a certain degree-of-freedom, e.g. constant angle 
offset of a rotating object, and defined this set as seed views. Strategies are 
needed for determining the appropriate offset automatically. Alternatively, 
more sophisticated approaches are conceivable for choosing a relevant set of 
seed views, as discussed in Section 3.5. 

3.4.4 Learning Strategy with PCA/GBF Mixtures 

In Subsection 3.2.1, the learning of a recognition function has been embed- 
ded into the problem of estimating the parameters of an implicit function. 
The set of example vectors which fulfills the function equation approximates 
the pattern manifold of the relevant object. For realistic applications a cer- 
tain degree of inequality must be accepted, i.e. pure invariance is relaxed to 
compatibility. In Subsection 3.2.2, it was suggested to represent the implicit 
function by a network of Gaussian basis functions, because universal approx- 
imations can be reached by varying the number of GBFs. For fine-tuning 
the GBFs one should exploit temporal continuity in acquiring the training 
views, as was motivated in Subsection 3.4.3. In Subsection 3.2.3, the implicit 
function has been defined alternatively as a hyper-ellipsoid, whose principal 
axes and half-lengths are determined by principal component analysis of a 
small set of seed views. The number of principal axes is equal to the number 
of seed views and this number is much less than the size of an object pattern. 
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In consequence of this, the principal axes specify a canonical frame whose 
dimension is much less than the dimension of the input space. 

The Role of Counter Situations in Learning 

Both the GBF and the PCA approach to learning do not consider counter 
situations directly for acquiring the recognition functions. Instead, for a set 
of typical patterns of an object, a function is learned which responds almost 
invariant. The degree of compatibility is used as threshold for discriminating 
counter situations. In certain robot tasks this learning strategy is the only 
conceivable one, e.g. if a training set of views is available just for one specific 
object which should be detected in any arbitrary environment. However, in 
many robot applications it is realistic that certain environments or certain 
situations occur more frequent than others. It is important to consider counter 
situations from typical environments for fine-tuning a recognition function 
and thus increasing the robustness of recognition. 

Survey to the Coarse-to-Fine Strategy of Learning 

We present a coarse-to-fine strategy of learning a recognition function, which 
approximates the manifold of object patterns coarsely from a sparse set of 
seed views and fine-tunes the manifold with more specific object patterns or 
counter situations, so-called validation views. For the coarse approximation 
of the manifold either the GBF or the PGA approach is suitable. For fine- 
tuning the manifold we use once again a GBF network. Both the coarse and 
the fine approximation are controlled under certain PAG requirements (see 
Definition 3.1). The function for object recognition should be PAG-learned 
subject to the probability P’’ and threshold C,. It is reasonable to choose for 
parameter P’’ a positive real value near the maximum 1 (value 1 means 100 
percent) and also for parameter C a positive value near 1. 

Let be an implicit function according to equation (3.5), with param- 
eter vector B and input-output vector Z. The vector Z is composed of the 
input component X, representing a pattern, and the output component Y , 
representing a class label or scoring value. In this subsection we focus on the 
classification problem. We will have several functions each responsible for 
a certain class with label k. In the following, label k is suppressed to avoid 
the overload of indices. For convenience, we also suppress label k in vector Z 
and instead accept vector X in the application of function /®*(P, A). 

Coarse Approximation Based on Seed Patterns 

The first step, i.e. coarse approximation, is based on an ensemble of seed 
patterns X? € X“. It is assumed that function has been PAG-learned from 
the seed views subject to parameters P’’ and f with either the GBF or the 
PGA approach. The PAG requirement holds trivially with P'" = 1 and C = 
because both approaches are configured such that all seed patterns are located 
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on the orbit, respectively. Related to Definition 3.1 of PAC-recognition, in this 
first step the target patterns are the seed patterns and counter patterns are 
not considered. 

Fine Approximation Based on Validation Patterns 

In the second step, i.e. fine approximation, we take an ensemble of validation 
patterns XJ into account which is subdivided into two classes. The first class 
of validation patterns is extracted from additional views of the target 
object and the second class V"" of patterns is extracted from views taken 
from counter situations. Depending on certain results of applying the implicit 
function (computed in the first step) to these validation patterns we 
specify spherical Gaussians according to equation (3.10) and combine them 
appropriately with the definition of the implicit function. The purpose is to 
modify the original implicit function and thus fine-tune the approximation of 
the pattern orbit of the target object, i.e. target patterns should be included 
and counter patterns excluded. 

For each validation pattern XJ G X~"p U X™ we apply function which 
yields a measurement of proximity to the coarsely learned pattern orbit. 

7^,:= (3.33) 

For r]j = 0 the pattern Xf is far away from the orbit, for ijj = 1 the pattern 
belongs to the orbit. There are two cases for which it is reasonable to modify 
the implicit function. First, maybe a pattern of the target object is too far 
away from the orbit, i.e. Xf G X~"p and rjj < C. Second, maybe a pattern of a 
counter situation is too close to the orbit, i.e. XJ G X"" and r/j > f. Pattern 
Xf is the triggering element for fine-tuning the coarse approximation. In the 
first case, the modified function should yield value 1 at pattern XJ, and in 
the second case, should yield value 0. Additionally, we would like to reach 
generalization effects in the local neighborhood of this pattern. A prerequisite 
is to have continuous function values in this local neighoorhood including the 
triggering pattern XJ. 

The modification of the implicit function takes place by locally putting a 
spherical Gaussian into the space of patterns, then multiplying a weight- 
ing factor to the Gaussian, and finally adding the weighted Gaussian to the 
implicit function. The mentioned requirements can be reached with the fol- 
lowing parameterizations. The center vector of the Gaussian is defined by the 
relevant pattern XJ. 

ff^(X):=exp(^-^-llX-XJII^ (3.34) 

For the two cases, we define the weighting factor wj for the Gaussian indi- 
vidually. It will depend on the computed measurement of proximity Tfj of the 
pattern XJ to the coarsely learned pattern orbit. 
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Wi := 



l-r]j 

-Vj 



target pattern too far away from orbit 
counter pattern too close to orbit 



(3.35) 



The additive combination between the implicit function and the weighted 
Gaussian yields a new function for which the orbit of patterns has been 
changed. By considering the above definitions in equation (3.33), (3.34), and 
(3.35, the modified function will meet the requirements. In particular, the 
desired results are obtained for the triggering pattern, i.e. X = Xf. In the 
first case, we assume a weighted Gaussian has been constructed for a vector 
Xj G X'’P . By applying the vector to the modified function the value 1 is 
obtained, i.e. the pattern belongs to the orbit. In the second case, we assume 
that a weighted Gaussian has been constructed for a vector Xf G T’'”. By 
applying the vector to the modified function the value 0 is obtained, i.e. the 
pattern is far away from the orbit. In both cases, the Gaussian value is 1, 
and the specific weight plays the role of an increment respective decrement 
to obtain the final outcome 1 respective 0. 

In the neighborhood (in the pattern space) of the triggering pattern the 
modified function produces a smoothing effect by the inherent extension of 
the Gaussian. It is controlled via factor t in equation (3.34). The higher 
this factor the larger the local neighborhood of Xf which is considered in 
modifying the manifold. The size of this neighborhood directly corresponds 
with the degree of generalization. 



Constructing the Recognition Function 

With this foundation we show how to construct recognition functions. Gener- 
ally, the ensemble of validation patterns contains several triggering patterns. 
We determine the set of triggering patterns and define for each one a Gaus- 
sian, respectively. Let j G 1, • • • , J be the indices of the set of necessary Gaus- 
sians as explained above. The overall recognition function can be defined 
as the sum of the transformed implicit function and the linear combination 
of necessary Gaussians. 

,/ 

r(X) := X)+Y. (^) (3-36) 

i=i 

Vector X represents an unknown pattern which should be recognized. The 
parameter vector B has been determined during the phase of coarse approxi- 
mation. The number and the centers of Gaussians are constructed during the 
phase of fine approximation. There is only one degree of freedom which must 
be determined, i.e. factor r for specifying the extent of the Gaussians. Iter- 
ative approaches such as the Levenherg-Marquardt algorithm can be applied 
for solution [134, pp. 683-688]. 
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Criterion for the Recognition of Unknown Patterns 

Based on a coarse approximation of the pattern manifold of the target object 
(using seed views), the approach has taken a further set of patterns into ac- 
count arising from the target object and from counter situations (using valida- 
tion views), in order to fine-tune the approximation and make the recognition 
function more robust. This coarse-to-fine strategy of learning can be applied 
to any target object which we would like to recognize. If k G 
is the index for a set of target objects, then recognition functions /^'^, with 
k G {1, • • • , K}, can be learned as above. For a robust discrimination between 
these target objects it is reasonable to learn recognition functions for target 
objects by considering the other target objects as counter situations, respec- 
tively. The final decision for classifying an unknown pattern X is by looking 
for the maximum result computed from the set of recognition functions 
kG{l,...,K}. 

k* := arg max /^°(X) (3.37) 

kG{l,-,K} 



Aspects of PAG Requirements to the Recognition Function 

All recognition functions must be PAC-learned subject to the parameters 
P’’ and C. The PAG requirement must be checked for the whole set of train- 
ing data, which consist of the seed patterns and the additional validation 
patterns of the target object, and the patterns of counter situations. If this 
requirement does not hold for certain recognition functions, it is necessary to 
increase the number of seed vectors and thus increase the dimension of the 
pattern space in which the fine approximation takes place. In consequence of 
this, each recognition function approximates a manifold of patterns, whose 
dimension depends on the difference or similarity compared to other tar- 
get objects in the task-relevant environment. The recognition function for 
an object with easy discrimination versus other objects is defined in a low- 
dimensional space, and in case of hard discrimination the function is defined 
in a high-dimensional space. This strategy of learning recognition functions is 
in consensus with the design principle of purposiveness (see Section 1.2), i.e. 
subject to the requirements of the task the recognition function is constructed 
with minimum description length. 

Visualization of the Approach Applied to Pattern Recognition 

The coarse-to-fine strategy of learning can be illustrated graphically. Similar 
to Section 3.2 we assume three seed patterns Af,A|, A| which are treated 
as points in the high-dimensional input space. By using principal component 
analysis for coarse approximation, we obtain an ellipse through the points. 
The exponential transformation of the implicit function yields values of prox- 
imity to the ellipse between 0 and 1 . Actually, the course of proximity values 
obtained from a straight course of points passing the orbit perpendicular is a 
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Gaussian (see Figure 3.37). Let us assume that the extent of this Gaussian is 
equal to the extent of Gaussians which are taken into account for fine-tuning 
the coarse approximation. 




Fig. 3.37. (Left) Ellipse through three seed vectors and perpendicular straight line 
across the ellipse; (Middle) Gaussian course of proximity values along the straight 
line, (Right) Constant course of proximity values along the ellipse. 



As a first example, it may happen that a pattern of a counter situation, 
i.e. XI G A’'”, is located on the ellipse (see Figure 3.38 (left)). A Gaussian 
is defined with Af as center vector, and weight := 1. The combina- 
tion of implicit function and weighted Gaussian according to equation (3.36) 
decreases the value of recognition function locally around point X^. Fig- 
ure 3.38 (middle) shows the effect along the straight course of points passing 
the orbit perpendicular. A positive and a negative Gaussian are added which 
yields constant 0. Figure 3.38 (right) shows the values of the recognition func- 
tion along the course of the ellipse, which are constant 1 except for the 
Gaussian decrease to 0 at point X^. 
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Fig. 3.38. (Left) Ellipse through three seed vectors and perpendicular straight line 
through a counter vector located on the ellipse; (Middle) Along the straight line the 
positive Gaussian course of proximity values is added with the negative Gaussian 
originating from the counter vector, resulting in 0; (Right) Along the ellipse the 
recognition values locally decrease at the position of the counter vector. 
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The second example considers a further counter pattern, i.e. X 2 G 
which is too near to the ellipse orbit but not located on it. Figure 3.39 shows 
similar results compared to the previous one, however the values of function 
along the course of the ellipse are less affected by the local Gaussian. 
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Fig. 3.39. (Left) Ellipse through three seed vectors and perpendicular straight line 
through a counter vector located near to the ellipse; (Middle) Along the straight line 
the positive Gaussian course of proximity values is added with the shifted negative 
Gaussian originating from the counter vector, such that the result varies slightly 
around 0; (Right) Along the ellipse the recognition values locally decrease at the 
position near the counter vector, but less compared to Figure 3.38. 

The third example considers an additional pattern from the target object, 
i.e. AI 3 G which is far off the ellipse orbit. The application of at 
yields rj^. A Gaussian is defined with vector Ag taken as center vector, and 
the weighting factor is defined by (1 — 773 ). The recognition function is 
constant 1 along the course of the ellipse, and additionally the function values 
around A| are increased according to a Gaussian shape (see Figure 3.40). 




Fig. 3.40. (Left) Ellipse through three seed vectors and perpendicular straight 
line through a further target vector located far off the ellipse; (Middle) Along the 
straight line the positive Gaussian course of proximity values is added with the 
shifted positive Gaussian originating from the target vector, such that the result 
describes two shifted Gaussians; (Right) Along the ellipse the recognition values 
are constant 1. 
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Experiments with the Coarse-to-Fine Strategy of Learning 

In the following experiments the coarse-to-fine strategy of learning a recog- 
nition function is tested. For the coarse approximation we exemplary use the 
PCA approach, and the refinement step is based on spherical GBFs. The 
primary purpose is to obtain a recognition function for a target object of 
three-dimensional shape, which can be rotated arbitrary and can have dif- 
ferent distances from the camera. According to this, both the gray value 
structure and the size of the target pattern varies significantly. Furthermore, 
we also deal with a specific problem of appearance-based recognition which 
arises from inexact pattern localization in the image, z. e. shifted target pat- 
terns. For assessing the coarse-to-fine strategy of learning we compare the 
results of recognition with those obtained by a coarse manifold approxima- 
tion and with those obtained from a 1-nearest-neighbor approach. The role of 
a first experiment is to obtain an impression of the robustness of recognition 
if in the application phase the views deviate in several aspects and degrees 
from the ensembles of seed and validation views. In a second experiment, 
the effect of the number of seed and validation views is investigated, z. e. the 
role of the dimension of the approximated pattern manifold with regard to 
robustness of recognition in the application phase. 

Experiment Concerning Robustness of Recognition 

The first experiment considers three objects, z. e. connection box, block of 
wood, and electrical board. For all three objects the system should learn a 
recognition function as described above. For this purpose we take 16 images 
from all three objects, respectively under equidistant turning angles of 22.5°, 
altogether 48 images. Figure 3.41 shows a subset of three images from the 
connection box (in the first row), the block of wood (second row), and the 
electrical board (third row) . For learning the recognition function of the con- 
nection box, its 16 images are taken as seed ensemble, and the 32 images 
from the other two objects are taken as validation ensemble. A similar split- 
up of the training views takes place for learning the other two recognition 
functions. 

Various Testing Ensembles of Images 

For applying the recognition functions we solely consider the connection box 
and take different testing images from it. For the first set of 16 testing images 
(denoted by RTi) the connection box is rotated by an offset angle 8° relative 
to the seed views (see Figure 3.42, first row, and compare with first row in 
Figure 3.41). For the second set of 16 testing images (denoted by RT 2 ) the 
connection box is rotated by an offset angle 14° relative to the seed views 
(see Figure 3.42, second row). For the third set of 16 testing images (denoted 
by SC\) the same rotation angles are used as for the seed views, but the 
distance to the camera has been decreased which results in pattern scaling 
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Fig. 3.41. Three seed images from a collection of 16 seed images respective for 
three objects. 



factor of about 1.25 (see Figure 3.42, third row, and compare with first row 
in Figure 3.41). For the fourth set of 16 testing images (denoted by SC 2 ) 
the same rotation angles are used as for the seed views, but the distance 
to the camera has been increased which results in pattern scaling factor of 
about 0.75 (see Figure 3.42, fourth row). For the fifth set of 16 testing images 
(denoted by SHx) the same rotation angles and camera distance are used as 
for the seed views, but the appearance pattern is shifted by 10 image pixel 
in vertical direction (see Figure 3.42, fifth row, and compare with first row 
in Figure 3.41). For the sixth set of 16 testing images (denoted by SH^) the 
same rotation angles and camera distance are used as for the seed views, 
but the appearance pattern is shifted by 10 image pixels in vertical and in 
horizontal direction (see Figure 3.42, sixth row). 

Applying Three Approaches of Recognition 

To this testing ensemble of images we apply three approaches of object recog- 
nition, denoted by CFinn, CFell, and CFegn- The approaches have in 
common that in a first step a testing pattern is projected into three 15- 
dimensional canonical frames (CFs). These are the eigenspaces of the con- 
nection box, the block of wood, and the electrical board, which can be con- 
structed from the 16 seed views, respectively. The second step of the ap- 
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Fig. 3.42. Three testing images from a collection of 16, for six categories respec- 
tively. 
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proaches is the characteristic one. In the first approach the recogni- 

tion of a testing view is based on the maximum proximity to all seed views 
from all three objects, and the relevant seed view determines the relevant 
object (it is a 1 -nearest-neighbor approach). In the second approach CFell 
the recognition of a testing view is based on the minimum proximity to the 
three hyper-ellipsoids constructed from the 16 seed views of the three objects, 
respectively. Please notice, that all 16 seed views of an object are located on 
its hyper-ellipsoid, and maybe the testing views deviate only to a certain 
degree. In the third approach CFegn, the recognition of a testing view is 
based on a refinement of the coarse approximation of the pattern manifold 
by considering counter views with a network of GBFs, i.e. the coarse-to-fine 
approach of learning. The decision for recognition is based on equation (3.37). 
All three approaches have an equal description length, which is based on the 
seed vectors of all considered objects, i.e. number of seed vectors multiplied 
by number of components. 

Recognition Errors for Varions Cases 

Table 3.1 contains the numbers of erroneous recognitions for all three ap- 
proaches and all six sets of testing views. As a first result, we observe that 
the CFell approach yields less errors than the CFimn approach. Obviously, 
the three hyper-ellipsoids, which have been constructed from the seed views 
of the three objects, must describe some appropriate relationships between 
the seed views. This aspect is completely suppressed in the 1-nearest neighbor 
approach. As a second result, we observe that the CFegn approach yields 
an error-free recognition for the whole testing ensemble and hence surpasses 
both the CFinn and the CFell approach. This result of the coarse-to-fine 
strategy of learning is encouraging, because we did take only 16 seed images 
from the target object and the validation ensemble consisted of counter views 
from other objects exclusively. The object has been recognized even in case 
of significant deviations from the seed ensemble of views. 



Errors 


RTi 


RT2 


SCi 


SC2 


SHi 


SH2 


CFinn 


0 


1 


4 


0 


15 


16 


CFell 


0 


0 


0 


0 


9 


1 


CFegn 


0 


0 


0 


0 


0 


0 



Table 3.1. Recognition errors for six categories of testing sets by applying three 
approaches, each testing set consists of 16 elements, 16 seed vectors and 32 valida- 
tion vectors have been used for learning the recognition functions. 



Experiment with Different Sizes of Seed Ensembles 

In the second experiment different sizes of seed ensembles are used for learning 
and applying the three recognition approaches CFien, CFell, and CFegn- 
Once again three objects are considered which look more similar between each 
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other (compared to the objects in the first experiment), z.e. integrated circuit, 
chip carrier, bridge rectifier. Figure 3.43 shows a subset of three images from 
each object, respectively. In the first case, the seed ensemble of each object 
may consist of 12 images which are taken under equidistant turning angles 
of 30°, i.e. altogether 36 seed images from all three objects. The split-up 
between seed and validation ensemble is done for each recognition function 
according to the approach in the first experiment. We obtain a canonical 
frame of 11 dimensions, hence the CFi^n approach is working with vectors of 
11 components, the CFell approach computes distances to hyper-ellipsoids 
of 11 dimensions, and in the CFegn the hyper-ellipsoid function is combined 
with maximal 22 GBF arising from counter views. 




Fig. 3.43. Three seed images from three objects, respectively. 



Various Testing Ensembles of Images 

For applying the recognition functions, we solely consider the integrated cir- 
cuit and take different testing images from it. For the first set of 12 testing 
images (denoted by RT) the integrated circuit is rotated by an offset angle 
15° relative to the seed views (see Figure 3.44, first row, and compare with 
first row in Figure 3.43). For the second set of 12 testing images (denoted by 
SC) the same rotation angles are used as for the seed views, but the distance 
to the camera has been decreased which results in a pattern scaling factor of 
about 1.25 (see Figure 3.44, second row, and compare with first row in Fig- 
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ure 3.41). For the third set of 12 testing images (denoted by SH) the same 
rotation angles and camera distance are used as for the seed views, but the 
appearance pattern is shifted by 10 image pixel in horizontal direction (see 
Figure 3.44, third row, and compare with first row in Figure 3.43). 




Fig. 3.44. Three testing images for three categories, respectively. 



Recognition Errors for Various Cases 

Table 3.2 contains the number of erroneous recognitions for all three ap- 
proaches and all three sets of testing views. The CFegn approach does not 
surpass the other two approaches. We explain this unexpected result by the 
low number of testing views, i.e. it is not representative statistically. 



Errors 


RT 


SC 


SH 


CFinn 


1 


2 


6 


CFell 


3 


0 


4 


CFegn 


3 


0 


4 



Table 3.2. Recognition errors for three categories of testing sets, each testing set 
consists of 12 elements, 12 seed vectors and 24 validation vectors have been used 
for learning the recognition functions. 
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In the second case, the seed ensemble of each object consists of 20 images 
which are taken under equidistant turning angles of 18°, z.e. altogether 60 
seed images from all three objects. The testing sets RT, SC, and SH are 
taken according to a strategy similar to the first case but each set consists 
of 20 images, respectively. Table 3.3 shows that the CFegn surpasses the 
CFimn approach for all three testing sets. 



Errors 


RT 


SC 


SH 


CFinn 


0 


5 


11 


CFell 


7 


2 


7 


CFegn 


0 


0 


8 



Table 3.3. Recognition errors for three categories of testing sets, each testing set 
consists of 20 elements, 20 seed vectors and 40 validation vectors have been used 
for learning the recognition functions. 



In the third case, the seed ensemble of each object consists of 30 images 
which are taken under equidistant turning angles of 12°, i.e. altogether 90 
seed images from all three objects. Once again the CFegn surpasses the 
CFimn approach for all three testing sets as shown in Table 3.4. 



Errors 


RT 


SC 


SH 


CFinn 


0 


8 


15 


CFell 


1 


1 


16 


CFegn 


0 


0 


12 



Table 3.4. Recognition errors for three categories of testing sets, each testing set 
consists of 30 elements, 30 seed vectors and 60 validation vectors have been used 
for learning the recognition functions. 



Finally, in the last case, we use just one testing set and compare the recog- 
nition errors directly for different dimensions of the recognition function. In 
the previous cases of this experiment the relevant testing set has been de- 
noted by SC which consisted of images with rotation angles equal to the seed 
images, but with a changed distance to the camera. This changed distance 
to the camera is taken once again, but the whole testing set consists of 180 
images from the integrated circuit under rotation with equidistant turning 
angle of 2° . Therefore, both a variation of viewing angle and viewing distance 
is considered relative to the seed views. 

Table 3.5 shows the result of applying recognition functions, which have 
been constructed from 6 seed views (denoted by NS\), from 12 seed views (de- 
noted by NS2), from 20 seed views (denoted by NS3), or from 30 seed views 
(denoted by NS4). The result is impressive, because the CFell approach 
clearly surpasses CFimn, and our favorite approach CFegn is clearly better 
than the other two. The course of recognition errors of CFegn, by increasing 
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the dimension, shows the classical conflict between over-generalization and 
over-fitting. That is, the number of errors decreases significantly when in- 
creasing the dimension from N Si to N S 2 , but remains constant or even gets 
worse when increasing the dimension further from N S 2 to N S 3 or to N S 4 . 
Therefore, it is convenient to take the dimension NS 2 for the recognition 
function as compromise, which is both reliable and efficient. Qualitatively, 
all our experiments showed similar results. 



Errors 


NSi 


NS 2 


NS 3 


NS 4 


CFinn 


86 


59 


50 


49 


CFell 


32 


3 


14 


18 


CFegn 


24 


1 


2 


3 



Table 3.5. Recognition errors for one testing set, which now consists of 180 ele- 
ments. The approaches of object recognition have been trained alternatively with 
6, 12, 20, or 30 seed vectors, for the CFegn approach we take into account addi- 
tionally 12, 24, 40, or 60 validation vectors. 



As a resume of all experiments to object recognition, we can draw the 
conclusion that the dimension of the appearance manifold can be kept sur- 
prisingly low. 



3.5 Summary and Discussion of the Chapter 

The coarse-to-fine approach of learning can be extended and improved in 
several aspects. 

First, in the phase of visual demonstration we can spend more effort in 
selecting appropriate seed views. For example, trying to determine a proba- 
bility distribution of views and selecting a subset of the most probable one. 
In line with this, we can also apply a selection strategy which is based on 
maximizing the entropy. Furthermore, a support vector approach can be used 
for selecting from the seed and validation views the most critical one. The 
common purpose of these approaches is to keep the seed and validation en- 
semble as small and hence the number of dimensions as low as possible. These 
approaches belong to the paradigm of active learning in which random or sys- 
tematic sampling of the input domain is replaced by a selective sampling [38]. 

Second, for complicated objects or situations maybe a large set of seed 
views and/or validation views is necessary for a robust recognition. This set 
can be splitted up in several subsets and used for constructing canonical 
frames individually, herefore, the recognition function of an object can be 
based on a mixture of low-dimensional canonical frames approximating a 
complicated pattern manifold [165]. Only minor changes need to be done in 
our approach presented above. 
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Third, the aspect of temporal continuity in learning and applying recog- 
nition functions has been ignored in Subsection 3.4.4, however its important 
role has been worked out clearly in Subsection 3.4.3. In line with this, it may 
be useful to consider hyper-ellipsoidal basis functions both for the coarse ap- 
proximation from the seed views and the fine approximation from validation 
views. 

Fourth, it has already been worked out in Subsections 3.4.1 and 3.4.2 that 
certain image transformations, such as log-polar transformation, can reduce 
the complexity of pattern manifolds, and hence reduce the complexity of the 
recognition function. However, for a reasonable application we must keep a 
certain relationship between scene object and camera. In the next Chapter 
4 this aspect is treated in the context of robotic object grasping using an 
eye-on-hand system. 

Quite recently, novel contributions appeared in the literature which are 
related to or can further our concept of object recognition. The exploitation 
of space-time correlations in Subsection 3.4.3 is an exemplary application of 
the concept of tangent distance, as introduced by Simard et al. [157]. The 
work of Arnold et al. [8] combines Binford’s concept of quasi-invariants, the 
Lie group analysis, and principal component analysis for constructing suh- 
space invariants for object recognition. In Subsection 3.4.4, we presented an 
approach for learning a recognition function /'’“ which should be a quasi- 
invariant for the potential of apprearances of the target object and should 
discriminate appearances of other objects. Finally, Hall et al. [71] presented 
an approach for merging or splitting Eigenspaces which may contribute to 
the issue of appropriate dimensionality, as was raised in Subsection 3.4.4. 




4. Learning-Based Achievement of RV 
Competences 



For designing and developing autonomous camera-equipped robots this chap- 
ter presents a generic approach which is based on systematic experiments and 
learning mechanisms. The final architecture consists of instructional, behav- 
ioral, and monitoring modules, which work on and/or modify vector fields in 
state spaces. 



4.1 Introduction to the Chapter 

The introductory section of this chapter embeds our design methodology into 
the current discussion of how to build behavior-based systems, then presents 
a detailed review of relevant literature, and finally gives an outline of the 
following sections. 



4.1.1 General Context of the Chapter 

Since the early nineties many people were inspired from Brooks’ school of 
behaviorally organized robot systems and followed the underlying ideas in 
designing autonomous robot systems. Based on diverse experience with im- 
plementations, in the late nineties an ongoing discussion started in which 
advantages and drawbacks of the behavioral robotics philosophy have been 
weighted up.^ 

Advantages of Brooks’ School of Behavior-Based Robotics 

The behavioral robotics philosophy originates from the observation of living 
biological systems whose intelligence can be regarded as layered organization 
of competences with increasing complexity [25]. Primitive creatures (such 
as ants) survive with a few low-level competences in which the reaction is 
a dominating characteristic, and sophisticated creatures (such as humans) 
additionally possess high level, task-solving competences in which the deliber- 
ation is a dominating characteristic. The research community for autonomous 

^ For example, online discussion in newsgroup comp. ai. robotics. research in March 
1999. 



J. Pauli: Learning-Based Robot Vision, LNCS 2048, pp. 171-253, 2001. 
© Springer-Verlag Berlin Heidelberg 2001 




172 4. Learning-Based Achievement of RV Competences 



robots succeeded to synthesize certain low-level competences artificially. The 
systems work fiexible under realistic assumptions, because their experience 
has been grounded on real environments. By considering a minimalistic as- 
pect, just the information which is relevant for becoming competent must 
be gathered from the environment. All information acquired from the envi- 
ronment or obtained from internal processes are spread throughout various 
modules of different control systems on different levels. Both the layered con- 
trol and the decentralized representation makes the system fault-tolerant and 
robust in the sense that some vital competences continue to work whereas 
others can fail. Simple competences, such as obstacle avoiding in a priori un- 
known environments, can be learned without any user interaction, i.t. relying 
only on sensor information from which to determine rewards or punishments 
for the learning procedure. In summary, artificial robot systems can acquire 
and adapt simple behaviors autonomously. 

Drawbacks of Brooks’ School of Behavior-Based Robotics 

In a strict interpretation of Brooks’ school the behavioral robotics philoso- 
phy can hardly deal with high-level, deliberate tasks [167].^ Deliberations are 
represented as maps, plans, or strategies, which are the basis for collecting 
necessary information, taking efficient decisions or anticipating events. Usu- 
ally, for solving a sophisticated task the necessary information is distributed 
throughout the whole scene. The information must be collected by moving 
sensors or cameras and perceiving the scene at different times and/or per- 
spectives. The other aspect of dealing with deliberate tasks is that we are 
interested in a goal-directed, purposeful behavior in order to solve an exem- 
plary task within the allocated slice of time. Due to the complexity of the 
particular target goal it must be decomposed in sub-goals and these must be 
distributed throughout the various modules which achieve certain sub-tasks 
and thus contribute to the target goal. Brooks denies a divide-and-conquer 
strategy like that and argues that it is difficult if not impossible to explicitly 
formulate goals or sub-goals. Instead of that, a purposeful overall behavior is 
supposed to be reached by combining individual behaviors each one working 
with a generate-and-test strategy which is based on elementary rewards or 
punishments. However, in our opinion it is extremely difficult to predict the 
behavior if starting with trial and error, and usually such a strategy proves 
inadequate to solve an exemplary task at the requested time. In summary, 
both the lack of a central representation and the lack of purpose injection 
makes it hard to design an autonomous robot system which is supposed to 
solve a high-level, deliberate task. The dogmatic adherence to the strict be- 
havioral robotics philosophy seems to be not helpful in designing autonomous 
robots for high-level tasks. 



^ We use the attribute deliberate for characterizing sophisticated tasks which rely 
on deliberations and can not be solved in a reactive mode exclusively [34]. 
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Relevant Contributions for Designing Autonomous Robots 

We will comment on four relevant contributions in the literature in order 
to prepare a proposal of a behavior-based design of autonomous robot sys- 
tems, which saves the advantages and overcomes the drawbacks of the strict 
behavioral robotics philosophy. 

Arkin proposes a hybrid deliberative /reactive architecture in which a delib- 
erative component supervises and modifies (if necessary) the reactive com- 
ponent continually [7]. The deliberative component is based on a central 
representation of information which is subdivided in a short term and a long 
term memory. The short term memory is used for collecting sensor informa- 
tion that continually comes into and goes out from the sensory range, and 
the current and past information will be useful for guiding the robot. The 
long term memory may contain maps from a stable environment or physical 
parameters from sensors or robot components. Symbolic planning methods 
are proposed for implementing the deliberative components, and reinforce- 
ment learning is suggested for acquiring reactive behaviors. However, the 
applicability in hard problems has not been demonstrated. Presumably, the 
traditional AI methods must be redesigned for real world problems, in which 
automatic adaptation and learning plays a more fundamental role. 

Colombetti et al. propose a methodology for designing behavior-based sys- 
tems which tries to overcome the ad-hocery sometimes criticized in Brooks’ 
system design [39]. The guiding line consists of the following four steps. First, 
the requirements of the target behavior should be provided. Second, the tar- 
get behaviors must be decomposed in a system of structured behaviors to- 
gether with a learning or controlling procedure for each behavior. Third, the 
structured behaviors should be trained in a simulated or the real environ- 
ment. Fourth, a behavior assessment takes place based on the pre-specified 
requirements of the target behavior. In our opinion this design methodology 
suffers from a vital aspect, namely it is suggested to decompose the target 
behavior into a system of structured behaviors in a top-down manner. That 
is, an idealized description of the environment and a description of the robot 
shell is the basis for the behavioral organization. Furthermore, the control 
or learning procedures, to be provided for the individual modules, must also 
be configured and parameterized from abstract descriptions. In our opinion, 
this designing approach will not work for hard applications, because abstract, 
idealized decriptions of the real environment are not adequate for a useful 
design process. 

The knowledge-based, top-down design of traditional Artificial Intelli- 
gence systems is not acceptable, because the underlying symbolic reasoning 
mechanisms are lacking robustness and adaptability in practical applications 
[162]. Instead of that, experiments conducted in the real environment must 
be the design driver, and the design process can only succeed by assessment 
feedback and modification of design hypotheses in a cyclic fashion. Our fa- 
vorite methodology is called behavioral, bottom-up design to be carried out 
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during an experimentation phase, in which the nascent robot learns compe- 
tences by experience with the real environment. This is the only possibility to 
become aware of the problems with perception and to realize that perception 
must be treated in tough inter-connection with actions. In Subsection 1.3.1 
we distinguished two alternative purposes for actuators in camera-equipped 
robot systems, z. e. dealing with vision-for-robot tasks and robot-for- vision 
tasks. Related to these two cases the following ground truths illustrates the 
interconnection between perception and action. 



• Visual classes of appearance patterns are relevant and useful only 
if an actoric verification is possible, e.g. a successful robotic grasp- 
ing is the criterion for verifying hypothetical classes of grasping 
situations. 

• For visual surveillance of robotic manipulation tasks the camera 
must be arranged based on perceptual verification which is leading 
to appropriate observation, i.e. successful and non-successful grasps 
must be distinguishable. 



We realize that perception itself is a matter of behavior-based designing 
due to its inherent difficulty. One must consider in the designing process the 
aspect of computational complexity of perception. This is quite in consensus 
with the following statement of Tsotsos [166]. 



’’Any behaviorist approach to vision or robotics must deal with the 
inherent computational complexity of the perception problem, other- 
wise the claim that those approaches scale up to human-like behavior 
is easily refuted.” 



The computational complexity of perception determines the efficiency and 
effectivity of the nascent robot system, and these are two major criteria for 
acceptance beside the aspect of autonomy. Usually, any visual or robotic task 
leaves some degrees of freedom for doing the perception, e.g. choosing among 
different viewing conditions and/or among different recognition funtions. It 
is a design issue to discover for the application at hand relevant strategies 
of making manifold construction tractable, as discussed in Subsection 3.4.1 
principally. In line with that, the system design should reveal behaviors which 
control the position and orientation of a camera such that it takes on a 
normalized relation to scene objects. Under these assumptions, the pattern 
manifold for tasks of object recognition or tracking can be simplified, e.g. by 
log-polar transformation as introduced in Subsection 3.4.2. 

4.1.2 Learning Behavior-Based Systems 

In Subsections 1.2.2 and 1.3.2 we characterized Robot Vision and Autonomous 
Robot Systems and interconnected both in a definition of Autonomous Came- 
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ra-equipped Robot Systems. This definition naturally combines characteristics 
of Robot Vision with the non-strict behavioral robotics philosophy, however it 
does not include any scheme for designing such systems. Based on the discus- 
sion in the previous section we can draw the following conclusion concerning 
the designing aspect. 



The behavior-based architecture of an autonomous camera-equipped 
robot system must be designed and developed on the basis of exper- 
imentation and learning. 



Making Task-Relevant Experience in the Actual Environment 

In an experimental designing phase the nascent system should make task- 
relevant experience in the actual environment (see Figure 4.1). An assessment 
is needed of how certain image operators, control mechanisms, or learning 
procedures behave. If certain constituents do not behave adequately, then a 
redesign must take place which will be directed and/or supported by a human 
designer.^ Simple competences such as obstacle avoiding can be synthesized 
with minor human interaction mainly consisting of elementary rewards or 
punishments. Competences for solving high-level, deliberate tasks such as 
sorting objects can only be learned with intensified human interaction, e.g. 
demonstrating appearances of target objects. 

The configuration of competences acquired in the experimental designing 
phase will be used in the application phase. During the task-solving process 
a human supervisee can observe system behaviors in the real environment. 
Each contributing behavior is grounded on image operations, actuator move- 
ments, feedback mechanisms, assessment evaluations, and learning facilities. 
For the exceptional case of undesired system behavior, an automatic or man- 
ual interruption must stop the system, and a dedicated redesigning phase will 
start anew. We distinguished between the experimental designing phase and 
the application phase, and discussed about system behaviors in both phases. 
The critical difference is that during the experimental designing phase a hu- 
man designer is integrated intensively, and during the application phase just 
a human supervisor is needed for interrupting the autonomous system be- 
haviors in exceptional cases. 

Task-Relevant Behaviors as a Result of Environmental Experience 

The outcome of the experimental designing phase is a configuration of task- 
relevant behaviors (see again Figure 4.1). The task-relevance is grounded on 
making purposive experiences in the actual environment in order to become 
acquainted with the aspects of situatedness and corporeality (see Subsection 

® In our opinion, the automatic, self-organized designing of autonomous robot 
systems for solving high-level, deliberate tasks is beyond feasibility. 
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1.3.2). We have already worked out in Chapters 2 and 3 that visual demon- 
stration and learning processes play the key role in acquiring operators for 
object localization and recognition. The acquisition and/or fine-tune of image 
operators is typically done in the experimentation phase. Generally, image 
operators are a special type of behaviors which extract and represent task- 
relevant information in an adaptive or non-adaptive manner. Those image 
processing behaviors are constituents of robotic behaviors which are respon- 
sible for solving robotic tasks autonomously. Other constituents of robotic 
behaviors may consist of strategies for scene exploration, control procedures 
for reaching goal situations, learning procedures for acquiring robot trajec- 
tories, etc. It is the purpose of the designing phase to decompose the target 
behavior into a combination of executable behaviors, determine whether a 
behavior should be adaptive or non-adaptive, and specify all relevant con- 
stituents of the behaviors. For the behavioral organization we have to consider 
requirements of robustness, flexibility, and time limitation, simultaneously. 




Image-Based Robot Servoing 

The backbone of an autonomous camera-equipped robot system consists of 
mechanisms for image-based robot servoing. These are continual processes of 
perception- action cycles in which actuators are gradually moved and contin- 
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ually controlled with visual sensory feedback. Primary applications are tool 
handling of manipulating objects and camera movement for detailed, incre- 
mental object inspection. For example, for the purpose of object inspection 
the camera head should be controlled to reach a desired size, resolution, 
and orientation of the depicted object. A secondary application is the self- 
characterization of cameras, i.e. determining the optical axes, the fields of 
view, and the location of the camera head. In the experimental designing 
phase one has to configure and parameterize servoing mechanisms, e.g. re- 
design non-successful controllers and find out successful controllers which 
reach their sub-goals. 

We are convinced that the potential usefulness of image-based robot ser- 
voing is far from being sufficiently realized which is due to several reasons. 
First, visual goal situations of robot servoing should be grounded in the real 
environment, i.e. should be specified by visual demonstration in the exper- 
imental designing phase. For example, in Subsection 3.3.4 we demonstrated 
examples of scored grasping situations, which can be used as guiding line for 
servoing the robot hand towards the most stable grasping position. Second, 
the various degrees- of- freedom (DOF) of a robot head, e.g., pan, tilt, vergence, 
focus, focal length, and aperture of the head-cameras, must be controlled in 
cooperation in order to exploit their complementary strengths. Unfortunately, 
our work will not contribute to this problem. Third, nearly all contributions 
to robotic visual servoing describe systems consisting of just one robot, e.g., 
exclusively a robot manipulator or a robot head. Instead of that, we present 
applications of image-based robot servoing for a multi- component robot sys- 
tem consisting exemplary of a mobil robot head, a stationary manipulator, 
and a rotary table. 

The Issue of a Memory in Behavior-Based Systems 

A memory will be put at the disposal of the behavior-based camera-equipped 
robot system. This is inevitable necessary, if high-level, deliberate tasks must 
be solved autonomously and requirements like robustness, flexibility, and time 
limitation should hold simultaneously. It is convenient to decentralize the 
memory according to certain competences of the task-solving process, i.e. in- 
troducing a shared competence- centered memory. The memory of the system 
preserves the opportunity for incremental development, because the need for 
frequent sensoring in stable environments is reduced and information that is 
outside of the sensory range can be made use of for guiding the robot. The 
discovery of movement strategies for actuators will be accelerated based on 
memorized bias information which can be acquired in previous steps of the 
task-solving process. For a high-level task-solving strategy several behaviors 
contribute information at different phases. It is this characteristic of eontrib- 
utory behaviors along with the diverse flow of information which lets us prefer 
a shared memory technology (as opposed to a message passing technology) 
and thus simplify the designing of the behavior-based robot system. 
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Acquiring Knowledge by Interaction with the Environment 

As opposed to traditional knowledge-based systems, all memory contents of 
a behavior-based system must be acquired and verified by system interaction 
with the environment. In the experimental designing phase this interaction 
is directed by a human system designer, and in the application phase the 
interaction should be autonomous. During the experimental designing phase 
the basic memory stuff will be acquired, e.g. feature compatibilities for object 
localization, pattern manifolds for object recognition, parameters for image 
pre-processing and feature extraction, camera-related characteristics like ge- 
ometric relation to the robot, optical axis, and field of view, parameters of 
visual feedback mechanisms, maps or strategies of camera movement for scene 
inspection, or of arm movement for object handling, etc. In the application 
phase the hot memory stuff for solving the task will be acquired, e.g. posi- 
tions of obstacles to be localized by moving the camera according to a fixed 
or exploratory strategy, obstacle-avoiding policy of manipulator movement 
for approaching a target position, geometric shape of a target object to be 
determined incrementally by observation from different viewpoints, etc. Dur- 
ing this application phase various system constituents will read from and 
write to certain regions of the memory, i.e. have a share in the memory. We 
summarize this discussion as follows. 



A camera-equipped robot system can reach autonomy in solving cer- 
tain high-level, deliberate tasks if three preconditions hold. 

• A behavioral, experimental designing phase must take place in the 
actual environment under the supervision of a system designer. 

• In the application phase various mechanisms of image-based robot 
servoing must be available for reaching and/or keeping sub-goals 
of the task-solving process. 

• Various exploration mechanisms should incrementally contribute 
information for the task-solving process and continually make use 
of the memorized information. 



The following section reviews relevant contributions in the literature. 

4.1.3 Detailed Review of Relevant Literature 

The behavioral, experimental designing phase is composed of demonstration, 
perception, learning, and assessment. The important role of this designing 
methodology for obtaining operators for object localization and recognition 
has already been treated in Section 1.4 and in Chapters 2 and 3, extensively. 
In this chapter the focus is more on acquiring strategies for decomposing a 
target behavior into a configuration of simpler behaviors, on the interplay 
between camera movement and object recognition or scene exploration, and 
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on the learning of purposive perception-action cycles. The tight coupling be- 
tween perception and action has also been emphasized by Sommer [162]. 
A more detailed treatment is presented in a work of Erdmann [51] who sug- 
gests to design so-called action-based sensors, i.e. sensors should not recognize 
states but reeognize applicable actions. A tutorial overview to action selection 
mechanisms is presented by Pirjanian [129, pp. 17-56], including priority-base 
arbitration {e.g. Brooks’ subsumption architecture), state-based arbitration 
{e.g. reinforcement learning approach), winner-take-all arbitration {e.g. acti- 
vation networks), voting-based command fusion {e.g. action voting), fuzzy- 
command fusion, superposition-based command fusion {e.g. dynamical sys- 
tems approach). 

A special journal issue on Learning Autonomous Robots has been pub- 
lished by Dorigo [49, pp. 361-505]. Typical learning paradigms are super- 
vised learning, unsupervised learning, reinforcement learning, and evolution- 
ary learning. With regard to applying video cameras only two articles are 
included which deal with automatic navigation of mobile robots. We rec- 
ommend these articles to the reader as bottom-up designing principles are 
applied, e.g. learning to select useful landmarks [67], and learning to keep or 
change steering directions [13]. Another special journal issue on Robot Learn- 
ing has been published by Sharkey [155, pp. 179-406]. Numerous articles deal 
with reinforcement learning which are applied in simulated or real robots. 
Related to the aspect of bottom-up designing the work of Murphy et al. [113] 
is interesting, because it deals with learning by experimentation especially for 
determining navigational landmarks. A special journal issue on applications 
of reinforcement learning is also published by Kaelbling [88] and a tutorial 
introduction is presented by Barto et al. [14]. 

The paradigm Programming by Demonstration (PbD) is relevant for this 
chapter. Generally, a system must record user actions and based on that 
must generalize a program that can be used for new examples/problems. 
Cypher edited a book on this topic which contains a lot of articles reporting 
about methodologies, systems and applications [45]. However, no articles on 
automated generation of vision algorithms or robot programs are included. 
Instead of that, the work of Friedrich et al. applies the PbD paradigm to 
generating robot programs by generalizing from several example sequences 
of robot actions [58] . The approaches in [84] and [93] differ from the previous 
one especially in the aspect that a human operator demonstrates objects, 
situations, actuator movements, and the system should make observations 
in order to recognize the operator intention and imitate the task solution. 
All three systems focus on the symbolic level and treat image processing or 
visual feedback mechanisms only to a little extent. 

We continue with reviewing literature on image-based robot servoing. 
Quite recently, a special issue of the International Journal on Computer Vi- 
sion has been devoted to Image-based Robot Servoing [79] . A book edited by 
Hashimoto gives an overview of various approaches of automatic control of 
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mechanical systems using visual sensory feedback [76]. Let us just mention 
the introductory, tutorial work of Corke [41].^ There, two approaches of visual 
servoing are proposed, i.e. the position-based and the feature-based approach. 
In position-based control features are extracted from the image and used in 
conjunction with a geometric model of the target to determine the pose of 
the target with respect to the camera. In image-based servoing the last step 
is omitted, and servoing is done on the basis of image features directly. In 
our applications we use both approaches depending on specific sub-tasks. 

Hager et al. describe a system that positions a robot manipulator using 
visual information from two stationary cameras [70]. The end-effector and the 
visual features defining the goal position are simultaneously tracked using a 
proportional-integral (PI) controller. We adopt the idea of using Jacobians 
for describing the 3D-2D relation but taking projection matrices of a poorly 
calibrated head-camera-manipulatior relation into account instead of explicit 
camera parameters. 

The system of Feddema et al. tracks a moving object with a single camera 
fastened on a manipulator [54] . A visual feedback controller is used which is 
based on an inverse Jacobian matrix for transforming changes from image 
coordinates to robot joint angles. The work is interesting to us because the 
role of a teach-by- showing method is mentioned. Offline the user teaches the 
robot desired motion commands and generates reference vision-feature data. 
In the online playback mode the system executes the motion commands and 
controls the robot until the extracted feature data correspond to the reference 
data. 

Papanikolopoulos and Khosla present an algorithm for robotic camera 
servoing around a static target object with the purpose of reaching a certain 
relation to the object [124]. This is done by moving the camera (mounted on a 
manipulator) such that the perspective projections of certain feature points of 
the object reach some desired image positions. In our work, a similar problem 
occurs in controlling a manipulator to carry an object towards the head- 
camera such that a desired size, resolution, and orientation of the depicted 
object is reached. 

We continue with reviewing literature on mechanisms for active scene 
exploration and object inspection including techniques for incremental infor- 
mation collection and representation. A book edited by Tandy et al. gives 
the state of the art of exploratory vision and includes a chapter on robots 
that explore [98] . Image-based robot servoing must play a significant role es- 
pecially in model-free exploration of scenes. A typical exploration technique 
for completely unknown environments is reinforcement learning which has 
already been mentioned above. 

The two articles of Marchand and Chaumette [104] and Chaumette et 
al. [35] deal with 3D structure estimation of geometric primitives like blocks 

^ A tutorial introduction to visual servo control of robotic manipulators has also 
been published by Hutchinson et al. [82]. 
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and cylinders using active vision. The method is based on controlled camera 
motion of a single camera and involves to gaze on the considered objects. The 
intention is to obtain a high accuracy by focusing at the object and generating 
optimal camera motions. An optimal camera movement for reconstructing the 
cyclinder would be a cycle around it. This camera trajectory is acquired via 
visual servoing around a cylinder by keeping the object depiction in vertical 
orientation in the image center. A method is presented for combining 3D 
estimations under several viewpoints in order to recover the complete spatial 
structure. Generally, the gaze planning strategy mainly uses a representation 
of known and unknown spatial areas as a basis for selecting viewpoints. 

A work of Dickinson et al. presents an active object recognition strategy, 
which combines the use of an attention mechanism for focusing the search 
for a 3D object in a 2D image with a viewpoint control strategy for dis- 
ambiguating recovered object features [47]. For example, the sensor will be 
servoed to a viewing position such that different shapes can be distinguished, 
e.g. blocks and cylinders. Exploration techniques play also a role in tasks of 
vision-based, robotic manipulations like robotic grasping. Relevant grasp ap- 
proaching directions must be determined which can be supported by training 
the system to grasp objects. Techniques of active learning will reduce the 
number of examples from which to learn [144]. The emphasis of this chapter 
is not on presenting sophisticated approaches of scene exploration, but on the 
question of how to integrate such techniques in a behavioral architecture. 

Exploration mechanisms must deal with target objects, obstacle objects, 
and navigational strategies. A unified framework for representation can be 
provided by so-called dynamic vector fields which are defined in the dynamic 
systems theory [153, 50]. Attractor vector fields are virtually put at the posi- 
tions of target objects, and repellor vector fields are virtually placed at the 
positions of obstacle objects. By summarizing all contributing vector fields we 
obtain useful hypotheses of goal-directed, obstacle-avoiding navigation tra- 
jectories. In a work of Mussa-Ivaldi the attractors and repellors are defined 
exemplary by radial basis functions and gradients thereof [115]. This chapter 
will make use of the mentioned methodology for visually navigating a robot 
arm through a set of obstacle objects with the task of reaching and grasping 
a target object. 

Generally, the mentioned kind of vector fields simulate attracting and 
repelling forces against the robot effectors and therefore can be used for 
planning robot motions. The vector fields are obtained as gradients of so- 
called potential functions which are centered at target and obstacle positions. 
A tutorial introduction to so-called potential field methods is presented by 
Latombe [99, pp. 295-355]. In this chapter, we use the term dynamic vector 
field in order to emphasize the dynamics involved in the process of solving a 
high-level task. The sources of dynamics are manifold. For example, sub-goals 
are pursued successively, and certain situations must be treated unexpectedly. 
Furthermore, coarse movement plans are refined continually based on latest 
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scene information acquired from images. Related to that, gross motions are 
performed to roughly approach targets on the basis of coarse reconstructions 
and afterwards fine motions are performed to finally reach haptic contact [99, 
pp. 452-453]. The system dynamics will be reflected in ongoing changes of 
involved vector fields. 

The dynamic vector held approach considers the inherent inaccuracy of 
scene reconstruction naturally. This is achieved by specifying the width (sup- 
port) of the underlying potential functions dependent on expected levels of 
inaccuracy. The related lengths of the gradient vectors are directly correlated 
with acceptable minimal distances from obstacles. Another advantage is the 
minimal disturbance principle, i. e. a local addition or subtraction of attractors 
or repellors causes just a local change of the movement strategy. However, 
this characteristic of locality is also the reason for difficulties in planning 
global movement strategies. Therefore, apart from potential held methods, 
many other approaches on robot motion planning have been developed [99]. 
For example, graph-based approaches aim at representing the global connec- 
tivity of the robot ‘s free space as a graph that is subsequently searched for a 
minimum-cost path. Those approaches are favourable for more complicated 
planning problems, e.g. generating assembly sequences [141]. This chapter is 
restricted to simple planning problems for which dynamic held methods are 
sufficient. 

The deliberative task of robot motion planning must be interleaved with 
the reactive task of continual feedback-based control. In the work of Murphy 
et al. [114] a planning system precomputes an a priori set of optimal paths, 
and in the online phase terrain changes are detected which serve to switch 
the robot from the current precomputed path to another precomputed path. 
A real-time fusion of deliberative planning and reactive control is proposed 
by Kurihara et al. [95], whose system is based on a cooperation of behavior 
agents, planning agents, and behavior-selection agents. The system proposed 
by Donnart and Meyer [48] applies reinforcement learning to automatically 
acquire planning and reactive rules which are used in navigation tasks. Un- 
fortunately, most of the mentioned systems seem to work only in simulated 
scenes. Beyond the application in real scenes, the novelty of this chapter is to 
treat the deliberative and the reactive task uniformly based on the method- 
ology of dynamic vector fields. 

4.1.4 Outline of the Sections in the Chapter 

In Section 4.2 several basis mechanisms are presented, such as visual feedback 
control, virtual force superposition, and integration of deliberate strategies 
with visual feedback. A series of generic modules for designing instructions, 
behaviors, and monitors is specified. Section 4.3 introduces an exemplary 
high-level task, discusses designing-related aspects according to the bottom- 
up methodology, and presents task-specific modules which are based on the 
generic modules of the preceding section. In Section 4.4 modules for acquiring 
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the coordination between robots and cameras are presented. Depending on 
the relevant controller, which is intended for solving a sub-task, we present ap- 
propriate mechanisms for estimating the head-camera-manipulator relation. 
Furthermore, image-based effector servoing will be applied for determining 
the optical axis and the field of view of the head-camera. Section 4.5 discusses 
the approaches of the preceding sections. 



4.2 Integrating Deliberate Strategies and Visnal 
Feedback 

We describe two basis mechanisms of autonomous camera-equipped robot 
systems, i.e. virtual force superposition and visual feedback control.® Both 
mechanisms can be treated cooperatively in the framework of so-called de- 
liberate and reactive vector fields. Different versions and/or combinations of 
the basis mechanisms occur in so-called basic, generic modules which can be 
used as library for implementing task-specific modules. As a result of sev- 
eral case studies on solving high-level, deliberate tasks (using autonomous 
camera-equipped robot systems) we have discovered 12 different categories 
of generic modules. They are subdivided in three categories of instructional 
modules, six categories of behavioral modules, and three categories of moni- 
toring modules. Task-specific modules make use of the basic, generic modules, 
but with specific implementations and parametrizations. In the second part 
of this section we present and explain the scheme of the task-specific mod- 
ules and the basic, generic modules. The basis mechanisms and the generic 
modules will be applied to an exemplary high-level task in Section 4.3. 

4.2.1 Dynamical Systems and Control Mechanisms 

State space of proprioceptive and exteroceptive features In Subsec- 
tion 1.3.2 we introduced the criteria situatedness and corporeality for charac- 
terizing an autonomous robot system and especially we treated the distinction 
between proprioceptive and exteroceptive features. Proprioceptive features de- 
scribe the state of the effector, and exteroceptive features describe aspects of 
the environmental world in relation to the effector. 

Transition Function for States 

The proprioceptive feature vector of the effector is subdivided in a fixed state 
vector S'^ and a variable state vector S'"(t). The vector 5° is inherent constant, 
and the vector S'"{t) can be changed through a vector of control signals C{t) 

® For solving high-level, deliberate tasks the camera-equipped robot system must 
be endowed with further basic mechanisms, e.g. reinforcement learning and/or 
unsupervised situation clustering. However, a detailed treatment is beyond the 
scope of this work. 
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at time t. For example, the fixed state vector of a robot manipulator contains 
the Denavit-Hartenberg parameters length, twist, offset for each link which 
are constant for rotating joints [42]. The variable state vector S^{t) could 
be the 6-dimensional state of the robot hand describing its position (in 3D 
coordinates X, Y, Z) and orientation (in Euler angles yaw, pitch, roll). On the 
basis of the variable state vector S'"{t) and control vector C{t) the transition 
function determines the next state vector S'"{t + 1). 

Sfft + 1) := fffCft), Sfff)) (4.1) 

If the vectors C{f) and S'"(f) are of equal dimension with the components 
corresponding pairwise, and the function is the vector addition, then C(f) 
serves as an increment vector for S'" ft). For example, if the control vector for 
the robot hand is defined by C{t) := (AX, AY, AZ, 0, 0, 0)^, then after the 
movement the state vector S'"{t+1) describes a new position of the robot hand 
preserving the orientation. Both the state and control vector are specified in 
the manipulator coordinate system. 

Task-relevant control vectors are determined from a combination of pro- 
prioceptive and exteroceptive features. From a system theoretical point of 
view a camera-equipped robot system is a so-called dynamical system in 
which the geometric relation between effectors, cameras, and objects changes 
dynamically. The process of solving a high-level, deliberate task is the target 
behavior of the dynamical system, which can be regarded as a journey through 
the state space of proprioceptive and exteroceptive features. The character- 
istic of the task and the existence of the environment (including the robot) 
are the basis for relevant affinities and constraints from which to determine 
possible courses of state transitions. 

Virtual Force Superposition 

Affinities and constraints can be represented as virtual forces uniquely. The 
trajectory of a robot effector is determined by a superposition of several vir- 
tual forces and discovering the route which consumes minimal energy and 
leads to a goal state. This goal state should have the characteristic of equilib- 
rium between all participating forces. For example, the task of robotic navi- 
gation can be regarded as a mechanical process in which the target object is 
virtually attracting the effector and each obstacle object is virtually repelling 
the effector. Attractor forces and repellor forces are the basic constituents de- 
termining the overall behavior of an effector in a dynamical system. 

Attractor and Repellor Vector Fields 

We define attractor vector fields and repellor vector fields for the space of 
variable state vectors. Let 5^ denote a particular state vector, which is taken 
as the position of an attractor. Then, the attractor vector field is defined by 



VFa[SI] {S") := VA ■ {S"a - s") / || || 



(4.2) 
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The force vectors at all possible state vectors S'" of the effector are oriented 
to the attractor position S\ and factor va is specifying the unique length of 
the force vectors (see left image in Figure 4.2). On the other hand, a particular 
state vector S^ is taken as the position of a repellor. The repellor vector field 
can be defined by 

VFnlS'k] (S") := 2 -VB- (S" - S]^) • exp (- || /3 • (S" - S^) f ) (4.3) 

The force vectors in the neighborhood of the repellor position are directed 
radially off this position, and the size of the neighborhood is defined by factor 
vb (see right image in Figure 4.2). 
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Fig. 4.2. (Left) Attractor vector field; (Right) Repellor vector field. 



'Trajectories in the Superposed Vector Fields 

The overall force vector field is simply obtained by summing up the collection 
of attractor vector fields together with the collection of repellor vector fields. 

VFo[{S\,}, {Sl^}] (S") := ^ VFa[S\,] (S") + 

SI, 

J2VFR[Sl^]{Sn (4.4) 

Sr] 

For example, the left image in Figure 4.3 shows the superposition of the 
attractor and the repellor vector field of Figure 4.2. With these definitions 
the control vector C{t) can be defined subject to the current state vector 
S'" (t) of the effector by 

C{t) := FFo[{SlJ,{S]^,}] (S"(t)) (4.5) 

A trajectory towards the state vector under by-passing state vector 
S^ is shown in the right image of Figure 4.3. It has been determined by 
applying equation (4.1) iteratively. 
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Fig. 4.3. (Left) Superposition of attractor and repellor vector field; (Right) Tra- 
jectory towards state vector S\ under by-passing state vector 



It is convenient to call an attractor vector field simply an attractor and to 
call a repellor vector field simply a repellor. Furthermore, we call an equilib- 
rium point of a vector field simply an equilihrium. The force vector length at 
an equilibrium is zero, i.e. no force is at work at these points. For example, 
an equilibrium is located at the center of an attractor vector field (see left 
image of Figure 4.2). On the other hand, a repellor vector field also contains 
an equilibrium at the center but additionally an infinite set of equilibriums 
outside a certain neighborhood (see dots without arrows in the right image 
of Figure 4.2). 

Vector Fields for Representing Planned and Actual Processes 

The force vector field can serve as a unique scheme both for representing 
a high-level task and for representing the task-solving process which the 
camera-equipped robot system is involved in. As a precondition for enabling 
situatedness and corporeality, the system must determine attractors and re- 
pellors automatically from the environmental images. The critical question is: 
Which supplements of the basic methodology of force vector fields are neces- 
sary in order to facilitate the treatment of high-level, deliberate tasks ? In the 
following subsection we introduce visual feedback control and explain why 
this supplement is necessary. After this, a distinction between deliberate and 
reactive vector fields is introduced for organizing the deliberate and reactive 
aspects of solving a high-level task. 

Generic Mechanism of Visual Feedback Control 

The particular state vectors S\ in equation (4.2) and S'^ in equation (4.3), 
based on which to specify attractors and repellors, are represented in the 
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same vector space as the variable state vector S'"{t) of the effector. For ex- 
ample, for the task of robotic object grasping the vector S'"{t) represents the 
robot hand position, vector S\ represents the target object position, and 
represents an obstacle object position, and all these positions are specified 
in a common coordinate system which is attached at the manipulator basis. 
Through control vector C{t) the state vector S'’{t) of the robot effector is 
changing and so does the geometric relation between robot effector and target 
object. 

For interacting with the environment the vectors S\ and 5^ must be 
determined on the basis of exteroceptive features, i.e. they are extracted 
from images taken by cameras. The acquired vectors are inaccurate to an 
unknown extent because of inaccuracies involved in image processing. With 
regard to the variable state vector S'’{t) we must distinguish two modalities. 
On the one hand, this current state of the robot effector is computed simply 
by taking the forward kinematic of the actuator system into account. On the 
other hand, the state of the robot effector can be extracted from images if 
the effector is located in the field of view of some cameras. Of course, for 
comparing the two results we must represent both in a unique coordinate 
system. However, the two representations are not equal, which is mainly due 
to friction losses and inaccuracies involved in reconstruction from images. 

In summary, all vectors S'jj, S'^, and S'"(t), acquired from the environment, 
are inaccurate to an unknown extent. Consequently, we will regard the force 
vector field just as a bias, i.e. it will play the role of a backbone which 
represents planned actions. In many sub-tasks it is inevitable to combine 
the force vector field with a control mechanism for fine-tuning the actions 
based on continual visual feedback. Based on this discussion we introduce the 
methodology of image-based effector servoing. 

Definition 4.1 (Image-based effector servoing) Image-based effector 
servoing is the gradual effector movement of a robot system continually con- 
trolled with visual sensory feedback. 

Measurement Function and Control Function 

In each state of the effector the cameras take images from the scene. This is 
symbolized by a measurement function which produces a current mea- 
surement vector Q{t) at time t (in coordinate systems of the cameras). 

Q{t):= r^{S^{t),S‘^) (4.6) 

The current state vector S^{t) in equation (4.6) is supposed to be de- 
termined by forward kinematics. The current measurement vector Q{f) may 
also contain features which are based on the fixed state vector of the effector 
but are extracted from image contents, e.g. image features which describe 
the appearance of the gripper fingers. 

According to this, the two modalities of representing the current state 
of the effector (as discussed above) are treated in the equation. Given the 
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current measurement vector Q{t), the current state vector and a desired 

measurement vector Q*, the controller generates a control vector C(t). 

C{t):=r\Q*,Q{t),S^t)) (4.7) 

The control function describes the relation between changes in differ- 
ent coordinate systems, e.g., Q{t) in the image and S'"{t) in the manipulator 
coordinate system. The control vector C{t) is used to update the state vector 
into S'"{t+ 1), and then a new measurement vector Q{t+ 1) is acquired which 
is supposed to be more closer to Q* than Q{t). In the case that the desired 
situation is already reached after the first actuator movement, the one-step 
controller can be thought of as an exact inverse model of the robot system. 
Unfortunately, in realistic control environments only approximations for the 
inverse model are available. In consequence of this, it is necessary to run 
through cycles of gradual actuator movement and continual visual feedback 
in order to reach the desired situation step by step, i.e. multi-step controller. 

Offline- and Online-Phase of Image-Based Effector Servoing 

Image-based effector servoing is organized into an offline-phase and an online- 
phase. Offline we specify the approximate camera-manipulator relation of 
coordinate systems and define the control function thereof. Frequently, 
the control function is a linear approximation of the unknown inverse model, 
i.e., the parameters Q* ,Q{f),S'"{t) are linear combined to produce C'(t).® 
Online the control function is applied during which the system recognizes a 
current situation and compares it with a certain goal situation. In case of 
deviation the effector is moving to bring the new situation closer to the goal 
situation. This cycle is repeated until a certain threshold criterion is reached. 

It is characteristic for image-based effector servoing to work with current 
and desired measurements Q{f) and Q* in the image directly and avoid an 
explicit reconstruction into the coordinate system of the effector. Typically, 
these image measurements consist of 2D position and appearance features 
which suffice in numerous applications and need not to be reconstructed. For 
example, without reconstructing the 3D object shape it is possible to control 
the viewing direction of a camera such that the optical axis is directed to the 
center of the object silhouette. 

Image-Based Servoing in the Framework of Force Vector Fields 

Image-based effector servoing can be regarded conveniently in the framework 
of force vector fields. However, the basic vector space does not consist of 
effector state vectors S'"{t), but consists of image measurement vectors Q{f). 
Effector servoing is goal-directed, as defined by equations (4.1), (4.6), (4.7), 
and can be represented by a fairly simply force vector field. It is just an 



Some articles in a book edited by Hashimoto [76] also describe nonlinear, fuzzy 
logic, and neural network control schemes. 
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attractor vector field, with one attractor specified by the desired measurement 
vector Q*, and no repellors are involved. The trajectory from Q{t) towards 
Q* is supposed to be a straight course, however it will be a jagged course 
actually. This is because the current measurement vector Q(t) can be changed 
only indirectly by effector movement, and the relevant control function is 
just an approximation which is inaccurate more or less. 

Compromise between Plan Fulfillment and Plan Adjustment 

In summary, in high-level, deliberate tasks we have to deal with two types 
of force vector fields, the first one has been defined in terms of effector state 
vectors and the second one in terms of image measurement vectors. This 
can also be realized in the two equations (4.5) and (4.7) which specify two 
different definitions for control vector C{t). The first one is obtained from a 
force vector field which will play the role of a plan for an effector trajectory. 
The second one is used for locally changing or refining the effector trajectory 
and thus is responsible for deviating from or adjusting the plan, if necessary 
according to specific criteria. In other words, the first category of force vector 
fields (which we call deliberate vector fields) is responsible for the deliberate 
aspect and the second category of force vector fields (which we call reactive 
vector fields) for the reactive aspect of solving high-level robotic tasks. 

In general case, one must find compromise solutions of plan fulfillment and 
plan adjustment. However, depending on the characteristic of the task and 
of the environment there are also special cases, in which either the reactive 
or the deliberate aspect is relevant exclusively. That is, image-based effector 
servoing must be applied without a deliberate plan, or on the other hand, a 
deliberate plan will be executed without frequent visual feedback. We present 
examples for the three cases later on in Subsection 4.2.2 and in Section 4.3. 

Managing Deliberate and Reactive Vector Fields 

The overall task-solving process of a camera-equipped robot system can be 
decomposed in so-called elementary and assembled behaviors. It will prove 
convenient to introduce an elementary behavior as the process of image-based 
effector servoing by which the current measurement vector is transformed it- 
eratively into a desired measurement vector. Specifically, the basic represen- 
tation scheme of an elementary behavior is a reactive vector field constructed 
by just one attractor, and the accompanying equilibrium must be approached 
by continual visual feedback control of the effector. The control vector C{t) 
is defined according to equation (4.7). In addition to this, we introduce an 
elementary instruction as an atomic step of changing effector state S'"{t) 
without visual feedback. The control vector C{t) can be defined according 
to equation (4.5). An assembled instruction is a sequence of elementary in- 
structions to be strung together, i.e. it is a course of effector movements as 
shown in the right image of Figure 4.3 exemplary. Elementary and assem- 
bled instructions are determined according to a plan which is represented 
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as a deliberate vector field. Finally, we define an assembled behavior as the 
composition of at least one elementary behavior with further elementary be- 
haviors and/or elementary or assembled instructions. Several combinations 
are conceivable depending on requirements of the application. A behavior 
(elementary or assembled) does include perception- action cycles, and an in- 
struction (elementary or assembled) does not. 

Instructions, Behaviors, Sub-tasks, Vector Fields 

The instructions and behaviors of a task-solving process reflect the decom- 
position of a high-level task into several sub-tasks. It will prove convenient to 
distinguish between elementary sub-tasks and assembled sub-tasks, with the 
latter being composed of the former. For solving an elementary sub-task one is 
working with a partial short-term plan, or is working without a plan at all. In 
the first case, in which a plan is involved, we will stipulate that in an elemen- 
tary sub-task the camera-equipped robot system is trying to reach just one 
deliberate goal and perhaps must keep certain constraints (constrained goal 
achievement). Related to the methodology of dynamical systems the overall 
force vector field for a sub-task must be constructed by just one attractor 
vector field and optionally summing it up with a collection of repellor vector 
fields, i.e. one attractor and optionally several repellors. The natural restric- 
tion on one attractor in a deliberate vector field reduces the occurrence of 
ambiguities while planning effector courses {e.g. only one equilibrium should 
occur in the whole vector field). ^ The second case in which no plan is in- 
volved means that an elementary sub-task must be solved by an elementary 
behavior (as introduced above) or any kind of combination between several 
elementary behaviors. However, no deliberate support is provided, i.e. triv- 
ially the deliberate vector field is empty without any virtual force vectors. 
The actual effector movements are controlled by visual feedback exclusively 
which is represented in reactive vector fields. 

Generic Deliberate Vector Fields, Current Deliberate Vector Fields 

During the task-solving process certain elementary sub-tasks will come to 
completion continually and other ones must be treated from the beginning, 
i.e. assembled sub-tasks. Related to the planning aspect a short-term plan 
is replaced by another one and all of which belong to an overall long-term 
plan. Related to the methodology of dynamical systems the deliberate vec- 
tor field is non- stationary during the overall process of task-solution. This 
means, from elementary sub-task to elementary sub-task both the attractor 
and the collection of repellors change in the overall vector field according to 
the replacement of the short-term plan. We introduce the following strategy 
for handling this non-stationarity. A so-called generic deliberate vector field is 

^ Opposed to that, the overall force vector field defined in equation (4.4) consists 
of several attractors and repellors including several equilibrium points. 
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constructed which represents the overall task, and from that we dynamically 
construct so-called current deliberate vector fields which are relevant just at 
certain intervals of time. Each interval is the time needed to solve a certain 
elementary sub-task. 

Examples for the Two Categories of Deliberate Vector Fields 

A scene may consist of a set of target objects which should be collected with 
a robot arm. A useful strategy is to collect the objects in a succession such 
that obstacle avoiding is not necessary. The generic vector field is generated 
by constructing attractors from all target objects. The current vector field 
should consist of just one attractor and will be derived from the generic field 
continually. The relevant succession of current vector fields is obtained based 
on considering the geometric proximity and relation between the target ob- 
jects and the robot arm. Figure 4.4 shows in the left image the process of 
approaching the effector to the accompanying equilibrium of the first attrac- 
tor (constructed at the first target object). Then, the object will be removed 
which causes also an erasure of the attractor, and finally a new attractor is 
constructed at another target object. The right image in Figure 4.4 shows 
the direct movement to the accompanying equilibrium of the new attractor. 
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Fig. 4.4. (Left) Direct approaching the accompanying equilibrium of the first 
attractor; (Right) Direct approaching the accompanying equilibrium of the second 
attractor. 



Another example is to visit a series of target objects with an eye-on-hand 
robot arm and visually inspect the objects in detail according to a certain 
succession. As soon as the inspection of an object is finished, i.e. a sub-task 
is completed, the attractor constructed at the object is transformed into a 
repellor, and additionally at the next object in the succession an attrac- 
tor is specified. In consequence of this, the robot arm is repelled from the 
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current object and attracted by the next one (multiple goals achievement). 
Figure 4.5 shows in the left image the process of approaching the effector to 
the accompanying equilibrium of the first attractor (constructed at the first 
target object). Then, the object is inspected visually, and after completion 
the attractor is replaced by a repellor at that place, and finally a new at- 
tractor is constructed at another target object. The right image in Figure 4.5 
shows the superposition of the attractor and the repellor vector field, and 
the course of effector movement which is repelled from the first object and 
attracted by the second one. 
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Fig. 4.5. (Left) Direct approaching the accompanying equilibrium of an attractor; 
(Right) Approaching the accompanying equilibrium of another attractor after being 
pushed away from the position, at which the former attractor changed to a repellor. 



Dynamical Change of Current Deliberate Vector Fields 

The dynamical construction of current deliberate vector fields from a generic 
deliberate vector field takes place during the process of task-solving. This 
on-line construction is necessary to keep the process under supervision by 
the relevant short-term plan of the overall plan. However, the task-solution 
can be reached only by real interaction with the environment which is done 
by visual feedback control as introduced above. First, the switch-off signal for 
completing a short-term plan and the trigger signal for starting a new short- 
term plan must be extracted from visual feedback of the environment. For 
example, in a task of robotic grasping one must recognize from the images a 
stable grasping pose, and this state will complete the elementary sub-task of 
approaching the target object, and the next elementary sub-task of closing 
the fingers can be executed. Second, in many sub-tasks the short-term plan 
should prescribe the course of effector movements only roughly, i.e. play the 
role of a supervisor, but the real, fine-tuned effector movements must be 
obtained by continual visual feedback. For example, in a task of manipulator 
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navigation one must avoid obstacles which are supposed to be located on 
places slightly deviating from the plan. In this case, visual feedback control 
should be responsible to keep the manipulator at certain distances to the 
obstacles. 

Reactive Vector Fields and Deliberate Vector Fields 

The basic representation scheme for visual feedback control is a reactive vec- 
tor field, and in the specific case of an elementary behavior it consists of just 
one attractor and nothing else (as discussed above). For this case, the current 
measurement vector in the image is transformed into a desired measurement 
vector iteratively. The desired measurement vector is the main feature for 
specifying the reactive vector field. 



The question of interest is, how to combine the deliberate and the 
reactive part of the task-solving process in the methodology of vector 
fields. 



For this purpose we regard the deliberate vector field more generally as a 
memory which is organized in two levels. The top level deals with the planning 
aspect and contains the centers of attractors, centers of repellors, and force 
vectors for reaching the equilibrium points. The bottom level includes the 
reactive aspect and contains a series of reactive vector fields which belong to 
certain attractors or repellors. For example, in the task of robotic grasping 
the attractor at the top level represents the final position of the robot hand 
roughly which is supposed to be relevant for grasping, and the reactive vector 
field at the bottom level is defined on the basis of a desired measurement 
vector which represents the pattern of a stable grasping situation. Therefore, 
the gripper can approach the target object based on the attractor in the top 
level, and can be fine-tuned on the basis of the relevant appearance pattern 
in the bottom level. Another example, in a task of manipulator navigation a 
repellor at the top level represents the rough position of an obstacle, and the 
desired measurement vector at the bottom level may represent the critical 
distance between manipulator and obstacle which should be surpassed. 

Three-Layered Vertical Organization of Force Vector Fields 

We can summarize that a task-solving process is organized vertically by three 
layers of force vector fields. These are the generic deliberate field at the top 
level, the layer of current deliberate fields at the middle level, and the cur- 
rent reactive fields at the bottom level. The deliberate field at the top level 
represents the overall task, the deliberate fields at the middle level describe 
the decomposition into elementary sub-tasks, and the reactive fields at the 
bottom level represent the actual task-solving process.® Previously, we distin- 

® Muller proposes a three-layered model of autonomous agent systems [109] which 
fits to the vertical organization of our system. The top, middle, and bottom level 
is called cooperation, planning, and reactive layer, respectively. 
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guished between behaviors (elementary or assembled) and instructions which 
can be considered anew in the methodology of vector fields. An elementary 
behavior is based on a reactive vector field only, an assembled behavior is 
based on reactive vector fields and possibly includes deliberate vector fields, 
and an instruction is based on deliberate vector fields only. Generally, the 
actual effector movements are determined by considering a bias which can be 
represented in a deliberate field, and additionally considering visual feedback 
control which is represented in reactive fields. 

However, for certain sub-tasks and environments it is reasonable to exe- 
cute the relevant plan without frequent visual feedback. In these cases, the 
actual task-solving process is represented by the deliberate fields only. On 
the other hand, for certain sub-tasks visual feedback control must be applied 
without a deliberate plan. After completing a sub-task of this kind the cov- 
ered course or the final value of variable state vector of the effector could be 
of interest for successive sub-tasks. Therefore, a possible intention behind a 
behavior is to extract grounded informations by interaction with the environ- 
ment. Especially, the grounded informations can contribute to the generation 
or modification of deliberate fields, i.e. biasing successive sub-tasks. Related 
to the organization of vector fields, we can conclude that in general a bidi- 
rectional flow of information will take place, i.e. from top to bottom and/or 
reverse. 

Horizontal Organization of Force Vector Fields for Various Effec- 
tors 

So far, it has been assumed that a high-level, deliberate task is treated by ac- 
tively controlling just one effector of a camera-equipped robot system. Only 
one generic deliberate field was taken into account along with the offspring 
fields at the middle and bottom level. However, in general a high-level robotic 
task must be solved by a camera-equipped robot system with more than one 
active effector. Several effectors must contribute for solving a high-level, delib- 
erate task and maybe even for solving a sub-task thereof. Previously, we stip- 
ulated that in each elementary sub-task the camera-equipped robot system 
is trying to reach just one deliberate goal and keep certain constraints. The 
critical issue concerns the complexity of the goal and/or the constraints, be- 
cause, based on this, the number and types of simultaneously active effectors 
are determined. For example, a surveillance task may consist of approach- 
ing a vehicle to a target position and simultaneously rotating a mounted 
head-camera for fixating an object at which the vehicle is passing by. In this 
case, two effectors must work simultaneously in a synchronized mode, i.e. the 
vehicle position and the head orientation are interrelated. Apart from the si- 
multaneous activity of several effectors it is usual that several effectors come 
into play one after the other in a sequential mode. 
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Splitting the Variable State Vector of Effectors 

Regardless of the simultaneous or the sequential mode, for each effector at 
least one generic deliberate field is required along with the offspring fields. 
However, for certain effectors it makes sense to split up the variable state 
vector into sub- vectors ■ ■ ■ , Corresponding to this, also 

the control vector C{t) must be split up into sub- vectors ■ ■ ■ , C’'”(t). 

As a first example, the vector of pose parameters of a robot hand can be split 
up in the sub- vector of position parameters {i.e. 3D coordinates X, Y, Z) and 
the sub-vector of orientation parameters (i.e. Euler angles yaw, pitch, roll). 
As a second example, the view parameters of stereo head-cameras can be 
treated as single-parameter sub- vectors which consist of the pan, the tilt, and 
the vergence angles. The splitting up of the variable state vector depends on 
the type of the effector and the type of the task. Furthermore, in certain tasks 
it is reasonable to select a certain sub- vector of parameters as being variable, 
and keep the other sub- vector of parameters constant. In conclusion, in the 
designing phase we must specify for each effector of the camera-equipped 
robot system a set of sub-vectors of parameters which will be potentially 
controlled for the purpose of solving the underlying sub-task. For each effector 
this set can be empty {i.e. the effector is be kept stable), or the set may 
contain one sub- vector {i.e. the effector can perform just one category of 
movements), or the set may contain several sub- vector {i.e. the effector can 
perform several categories of movements). 

Vertical and Horizontal Organization of Force Vector Fields 

Generally, for each effector a set of generic deliberate fields is required to- 
gether with the offspring fields, which depends on the complexity of the high- 
level task. The number of generic deliberate fields corresponds with the num- 
ber of sub- vectors of variable parameters. In consequence of this, apart from 
the vertical organization (discussed previously) the task-solving process must 
also be organized horizontally including more than one generic deliberate 
fields. Figure 4.6 shows figuratively the vertical and horizontal organization 
of deliberate and reactive fields, which may be involved in a task-solving pro- 
cess for a high-level, deliberate task. In the case that image-based effector 
servoing is applied without a deliberate plan, then the deliberate fields are 
trivial. In the case that a deliberate plan is executed without frequent vi- 
sual feedback, then the reactive fields are trivial. Information is exchanged 
vertically and horizontally which is indicated in the figure by bidirectional 
arrows. 

Monitoring the Task-Solving Process 

In the designing phase of a task-solving process the high-level, deliberate 
task must be decomposed into sub-tasks and for these one must regulate 
the way of cooperation. We distinguish two modes of cooperation, i.e. the 
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Fig. 4.6. Vertical and horizontal organization of deliberate and reactive fields 
involved in a task-solving process. 



sub-tasks are treated sequential or simultaneous with other sub-tasks. In the 
sequential mode several goals must be reached, but one after the other. The 
completion of a certain sub-task contributes current information or arranges 
an environmental situation which is needed for successive sub-tasks. Related 
to the methodology of dynamical systems these contributions are reflected 
by changing certain vector fields. For example, after grasping, picking-up, 
and removing a target object we have to make topical the relevant deliberate 
vector held, i.e. removing the attractor which has been constructed at the 
object. In the simultaneous mode of cooperating sub-tasks, several goals are 
pursued in parallel. Maybe, several sub-tasks are independent and can be 
treated by different effectors at the same time for the simple reason of saving 
overall execution time. On the other hand, maybe it is mandatory that several 
sub-tasks cooperate in a synchronized mode for solving a high-level task. 

Three Categories of Monitors 

A so-called monitor is responsible for supervising the task-solving process 
including the cooperation of sub-tasks.® More concretely, the monitor must 
take care for three aspects. First, for each sub-task there is a limited period 
of time, and the monitor must watch a clock and wait for the signal which 
indicates the finishing of the sub-task. If this signal is coming timely, then 
the old sub-task is switched off and the successive one switched on. How- 
ever, if this signal is coming belated or not at all, then exception handling is 

® Kruse and Wahl [91] presented a camera-based monitoring system for mobile 
robot guidance. We will use the term monitor more generally, e.g. including 
tasks of keeping time limitations, confirming intermediate situations, and treat- 
ing exceptional situations. 
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needed {e.g. stopping the overall process). Second, each sub-task should con- 
tribute topical information or arrange an environmental situation, and after 
completion of the sub-task the monitor must check whether the relevant con- 
tribution is supplied. For example, topical information maybe is collected in 
deliberate fields, and an analysis of the contents is required to decide whether 
the successive sub-task can be started. Third, any technical system, which 
is embedded in a real environment, must deal with unexpected events. For 
example, maybe a human is entering the working area of certain effectors 
illegally. The monitor must detect these events and react appropriately, e.g. 
continue after a waiting phase or final stopping the process. The monitors 
are implemented in the designing phase and are specific for each high-level 
task. 

Cooperation of Sub-tasks 

One must determine which sub-tasks should work sequentially or simultane- 
ously and which period of time is supposed to be acceptable. Furthermore, 
for each sub-task the constituents of the environment must be determined 
which are normally involved. A constituent of the environment can be the 
actuator system {e.g. diverse effectors), the cameras (mobil or stable), or 
several task-relevant objects {e.g. stable platform), etc. These fundamental 
informations (concerning periods of time or constituents of the environment) 
are needed for checking relevant contributions or detecting events during the 
task-solving process. Figure 4.7 shows for a task-solving process the cooper- 
ative arrangement of sub-tasks, i.e. sequentially or simultaneously, together 
with the environmental constituents involved in each sub-task (indicated by 
a dot). It is just a generic scheme which must be made contrete for each 
specific task during the designing phase. 

4.2.2 Generic Modules for System Development 

For supporting the designing phase of autonomous camera-equipped robot 
systems we discovered 12 generic modules. They are generic in the sense that 
task-specific modules will make use of them with specific parametrizations or 
specific implementations.^*^ Each generic module is responsible for solving an 
elementary sub-task. Regarding the explanations in the previous subsection 
we distinguish three instructional modules, six behavioral modules, and three 
monitoring modules. The instructional modules are based on deliberate vector 
fields, the behavioral modules generate reactive vector fields and possibly are 
based on deliberate vector fields. The monitoring modules play an exceptional 

The generic modules serve as design abstractions in order to simplify system de- 
velopment for a specific robotic task. This methodology is similar to the use of 
general design patterns for the development of object-oriented software products 
[61]. However, we propose application-specific design patterns for the develop- 
ment of autonomous, camera-equipped robot systems. 
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Fig. 4.7. Sequential and simultaneous cooperation of sub-tasks and involved en- 
vironmental constituents. 



role, i. e. they do not generate and are not based on vector fields. In Section 4.3 
we will show the usage of a subset of 9 generic modules (three instructional, 
behavioral, and monitoring modules) for an exemplary task. Therefore, this 
subsection presents just these relevant modules, and the remaining subset of 
3 behavioral modules is presented in Appendix 2. 

Generic Module MI^ for Assembled Instruction 

The module involves the human designer of the autonomous system. With 
the use of the control panel a certain effector can be steered step by step into 
several states. The relevant succession of elementary instructions is obtained 
from a deliberate field which may contain a trajectory. Based on this teach-in 
approach, the designer associates proprioceptive states with certain extero- 
ceptive features, e.g. determines the position of a static environmental object 
in the basis coordinate system of the robot arm. The list of teached states is 
memorized. No explicit, deliberate field is involved with this module, except 
the one in the mind of the human operator. 
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Module Mil 



1. Determine relevant type of variable state vector iS""}. 

2. Manually generate a state vector S'^{t) of the relevant effector. 

3. Memorize pair (t, S'"{t)) of demonstration index and new state vec- 
tor, 

and increment demonstration index t := t -I- 1. 

4. If (demonstrationJndex < required_number_of_demonstrations) 
then go to 2. 

5. Stop. 



Generic Module MI2 for Assembled Instruction 

The module represents a steering mechanism which changes the variable state 
of an effector step by step according to a pre- specified trajectory. The relevant 
succession of elementary instructions is obtained from a deliberate field which 
may contain a trajectory. The purpose is to bring the effector into a certain 
state from which to start a successive sub-task. The concrete course has been 
determined in the designing phase. No visual feedback control is involved, 
however the movement is organized incrementally such that a monitor module 
can interrupt in exceptional cases (see below monitor MM3). 



Module M/2 



1. Determine relevant type of variable state vectors iS^l. 

Take deliberate field of iS'^l into account. 

2. Determine current state vector S'"{t) of the relevant effector. 

3. Determine control vector according to equation 
C{t) := VFo[trajectory-forS^]{S^{t)). 

4. If ( II C{t) II < rji) then go to 7. 

5. Change variable state vector according to equation 
S-{t + l) := f^{C{t),S'’{t)), 

and increment time parameter t := t -I- 1. 

6. Go to 2. 

7. Memorize final state vector and stop. 
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Generic Module MI^ for Assembled Instruction 

The module is similar to the previous one in that the relevant succession 
of elementary instructions is obtained from a deliberate field. Additionally, 
image measurements are taken during the step-wise change of the effector 
state. The measurements are memorized together with the time indices, and 
the states of the effector. The purpose is to collect image data, e.g. for learning 
coordinations between effector state and image measurements, for learning 
operators for object recognition, or for inspecting large scenes with cameras 
of small fields of view. 



Module M/3 



1. Determine relevant type of variable state vectors and 
accompanying type of measurements iQl. 

Take deliberate field of lS'’l into account. 

2. Determine current state vector of the relevant effector. 

3. Determine control vector according to equation 
C{t) := VFo[trajectory-forS'"]{S'"{t)). 

4. If ( II C{t) II < ?7i ) then go to 9. 

5. Change variable state vector according to equation 

+ := f%C{t),S'’{t)), 

and increment time parameter t := t -I- 1. 

6. Determine new measurement vector Q{t). 

7. Memorize triple {t, S'"{t),Q{t)) of time index, new state vector, 
and new measurement vector. 

8. Go to 2. 

9. Stop. 



In the following we present three behavioral modules which will be used 
in the next section. Further behavioral modules are given in Appendix 2. 

Generic Module for Elementary Behavior 

The module represents a visual feedback control algorithm for the variable 
state vector of a robot effector. The current measurements in images should 
change step by step into desired measurements. The control function is based 
on approximations of the relationship between changes of the effector state 
and changes of measurements in the images, e.g. linear approximations with 
Jacobian matrices. Apart from the mentioned relationship no other bias is 
included, e.g. no plan in form of a deliberate field. The generated effector 
trajectory is a pure, reactive vector field. 
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Module MBi 



1. Take desired measurement vector Q* into account. 

2. Determine current measurement vector Q{t). 

3. Determine current state vector S'"{t) of the relevant effector. 

4. Determine control vector according to equation 
C{t) := r*{Q*,Q{t),S^t)). 

5. If ( II C{t) II < 772 ) then go to 8. 

6. Change variable state vector according to equation 

and increment time parameter t := t + 1. 

7. Go to 1. 

8. Return final state vector S^{t), and stop. 



Generic Module MB2 for Assembled Behavior 

The module is responsible for an assembled behavior which integrates two 
elementary behaviors, i.e. executing two goal-oriented cycles. While trying 
to keep desired image measurements of a first type, the robot effector keeps 
on changing its variable state to reach desired image measurements of a sec- 
ond type. The inner cycle takes into account the first type of measurements 
and gradually changes the effector state such that the relevant desired mea- 
surement is reached. Then, in the outer cycle one changes the effector state 
a certain extent such that the image measurement of the second type will 
come closer to the relevant desired measurement. This procedure is repeated 
until the desired measurement of the second type is reached. No plan in form 
of a deliberate field is used, and the generated effector trajectory is a pure, 
reactive vector field. However, the resulting course of the effector state is 
memorized, i.e. will be represented in deliberate fields. 
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Module MB 2 



1 . 



2 . 

3. 

4. 

5. 

6 . 

7. 

8 . 
9. 



10 . 

11 . 

12 . 



Determine relevant type of variable state vectors and 
accompanying type of measurements 
Initialization of a deliberate field for lS'"^l. 

Determine relevant type of variable state vectors and 
accompanying type of measurements iQ^^h 
Initialization of a deliberate field for 
Behavioral module MBi'. 

Configure with execution, and return ( ). 

Construct an equilibrium in the deliberate field iS'^^l based on 
current state vector 

Take desired measurement vector into account. 

Determine current measurement vector 

Determine current state vector of the relevant effector. 

Determine control vector according to equation 



™2(t) .= 

■ ( II II < 772 ) then go to 12 . 



Change variable state vector according to equation 

S'"'^{t+ 1) := S'“^(t)), and increment time parameter t 

:= t + 1 . 



Construct an equilibrium in the deliberate field based on 
new state vector 
Go to 2. 

Memorize final deliberate fields lS'’^l and and stop. 



Generic Module MBs for Assembled Behavior 

The module represents a steering mechanism which changes the variable state 
of an effector step by step according to a pre-specified course. The relevant 
succession of elementary instructions is obtained from a deliberate field which 
may contain a trajectory. In distinction to the instructional module MI 2 
certain measurements are taken from the image continually, and if certain 
conditions hold, then the plan execution is interrupted. At the bottom-level 
(of our vertical organization) no reactive vectors are generated, because visual 
feedback only serves for recognizing the stopping condition. 
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Module MBs 



1. Determine relevant type of variable state vectors 
Take deliberate field of iS""! into account. 

Determine relevant type of measurements iQl. 

2. Take desired measurement vector Q* into account. 

3. Determine current measurement vector Q{t). 

4. If ( II Q* - Q{t) II < 772 ) then go to 10. 

5. Determine current state vector S'"{t) of the relevant effector. 

6. Determine control vector according to equation 
C{t) := VFo[trajectory-forS^]{S^{t)). 

7. If II C{t) II < ? 7 i then go to 10. 

8. Change variable state vector according to equation 

and increment time parameter t := t + 1. 

9. Go to 2. 

10. Memorize final state vector and stop. 



The following three monitor modules are responsible for the surveillance 
of the task-solving process. Three important aspects have been mentioned 
above and three generic modules are presented accordingly. 

Generic Module MMx for Time Monitor 

The time monitor MM\ is checking whether the sub-tasks are solved timely 
(see above for detailed description) . 

Module MMi 



1. Take period of time for sub-task i into account. 

2. Determine working state of sub-task i. 

3. If (sub-taskJ_is_no_more_working) then go to 7. 

4. Determine current time. 

5. If (sub-taskJ_is_stilLworking_and_current_time_within_period) 
then wait a little bit, then go to 2. 

6. If (sub-task J_is_stilLworking_and_current_time_not_within_period) 
then emergency exit. 

7. Stop. 



Generic Module MM^ for Situation Monitor 

The situation monitor MM 2 is checking after completion of a certain sub- 
task, whether an environmental situation has been arranged which is needed 
in successive sub-tasks (see above for detailed description). 
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Module MM 2 



1. Determine working state of sub-task i. 

2. If (sub-taskJ_is_stilLworking) then wait a little bit then go to 1. 

3. Take goal situation of sub-task i into account. 

4. Take actual situation after completing sub-task i into account. 

5. Determine a distance measure between goal and actual situation. 

6. If (distance_measureJs_beyond_a_threshold) then emergency exit. 

7. Stop. 



Generic Module MMs for Exception Monitor 

The exception monitor MM3 observes the overall task-solving process with 
the purpose of reacting appropriately in case of unexpected events (see above 
for detailed description). 



Module MM3 



1. Take overall period of time for the high-level task into account. 

2. Throughout the period of time, for all sub-tasks i: 

2.1. Determine working state of sub-task i. 

2.2. Determine working space of sub-task i. 

2.3. If (sub-task J_is_working_and_unexpected_eventJn_ 
relevant_working_space) then emergency exit. 

3. Stop. 



Scheme of Task-Specific Modules MTi 

Specific implementations of the basic, generic modules will be used in task- 
specific modules which are responsible for solving certain tasks and sub-tasks. 
For simplifying the designing phase we will configure task-specific modules at 
several levels of an abstraction hierarchy, such that in general higher-level 
modules are based on combinations of lower-level modules. The bottom level 
of the abstraction hierarchy contains specific implementations of the generic 
modules introduced above. 

Generally, the scheme of a task-specific module consists of three compo- 
nents, i.e. combination of lower-level modules, functions for pre-processing 
and post-processing relevant data, and an input /output mechanism for read- 
ing from and writing to the memory. Concerning the combination of lower- 
level modules we distinguish between sequential (denoted by & ) and parallel 
execution (denoted by | ). The parallel combination is asynchronous in the 
sense that the control cycles for two participating lower-level modules will not 
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be synchronized between each other. Instead, synchronization takes place in 
form of simultaneously starting some lower-level modules and waiting until 
all of them have finished, and then starting the next module(s) in the suc- 
cession. However, if synchronization would be needed at a more fine-grained 
level of control, then it must be implemented as a basic module {e.g. see 
above the assembled behavior MH 2 ). As a result of the lower-level modules 
one obtains topical image information about the scene, and/or topical states 
of the effectors. 

For applying lower-level modules, perhaps certain functions for extract- 
ing relevant images features must be applied in advance. For parameterizing 
the functions specifically one may obtain data from local input of the cur- 
rent task-specific module, i.e. parameter list of task-specific module, or from 
global input, i.e. shared memory of the task-solving process. Alternatively, 
a function may implement an iterative, feedback-based approach of feature 
extraction in which the parameters are tuned autonomously.^^ The results of 
the pre-processing functions are forwarded to the lower-level modules, and 
the results of these modules are forwarded to the output (maybe after apply- 
ing post-processing functions). Local input or output contains intermediate 
data and global input or output contains final data which are represented in 
the shared memory for solving the high-level task. 

In summary, the generic scheme of task-specific modules contains the fol- 
lowing entries. 



Task-specific module MTi 



1. Name 

2. Local input 

3. Global input 

4. Pre-processing functions 

5. Combination of lower-level modules 

6. Post-processing functions 

7. Local output 

8. Global output 



In the next Section 4.3 we present some examples of task-specific modules. 
However, only the entry combination of lower-level modules will be specified 
in detail and the others will remain vague. 



11 



Feedback-based autonomous image analysis is one of the characteristics of Robot 
Vision (see Subsection 1.2.2). 
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4.3 Treatment of an Exemplary High-Level Task 

In Section 4.1 we explained that a bottom-up designing methodology is es- 
sential for obtaining an autonomous camera-equipped robot system. The ap- 
plication phase of a task-solving process must be preceded by an experimen- 
tation phase whose outcome is supposed to be a configuration of modules for 
solving the underlying task. In the preceding Section 4.2 we presented basis 
mechanisms and categories of generic modules which must be implemented 
specifically. This section describes an exemplary high-level task, applies the 
bottom-up designing methodology, and presents specific implementations of 
the generic modules. 



4.3.1 Description of an Exemplary High-Level Task 

The high-level deliberate task is to find a target object among a set of objects 
and carry it to another place for the purpose of detailed inspection. Figure 4.8 
shows the original scene including the robot system and other task-relevant 
environmental constituents. 




Fig. 4.8. The image shows the scene including robot arm, robot vehicle with 
binocular head, rotary table, and the task-relevant domestic area, inspection area, 
and parking area. 
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The computer system consists of a Sun Enterprise (E4000 with 4 Ultra- 
Sparc processors) for doing image processing and of special purpose proces- 
sors for computing the inverse kinematics and motor signals. The actuator 
system of the robot system is composed of four subsystems. The first sub- 
system is a robot arm (Staubli-Unimation RX-90) fastened on the ground 
plane. Based on six rotational joints one can move the robot hand in arbitrary 
position and orientation within a certain working space. Additionally, there 
is a linear joint at the robot hand for opening/closing parallel jaw fingers. 
The position of the robot hand is defined by the tool center point, which 
is fixed in the middle point between the two finger tips (convenient for our 
application). The second subsystem is a robot vehicle (TRC Labmate) which 
can move arbitrary in the neighborhood of the robot arm, e.g. turn round 
and translate in any direction. The position of the robot vehicle is defined 
as the center of the platform. The third subsystem is a robot head includ- 
ing a stereo camera (TRC bisight) which is fastened on the robot vehicle. 
The robot head is equipped with pan, tilt, and vergence degrees-of-freedom 
(DOF), and zooming/focusing facilities. By moving the vehicle and changing 
view direction of the cameras one can observe the robot arm and its working 
space under different viewing points. The fourth subsystem is a rotary table 
on which to place and to rotate objects in order to inspect them from any 
view. 

Further constituents of the scene are three particular ground planes which 
must be located within the working space of the robot hand (of the robot 
arm). These task-relevant planes are the so-called domestic area, the inspec- 
tion area, and the parking area. We assume, that several objects can be found 
on the domestic area. A specific target object is of further interest and should 
be inspected in detail which is done at the inspection area. For this purpose, 
the target object must be localized on the domestic area, carried away, and 
placed on the inspection area. However, it may happen that the target object 
can not be approached for robotic grasping due to obstacle objects. In this 
case, the obstacle objects must be localized, moved to the parking area and 
placed there temporary. 

Five Cameras for the Robot System 

The autonomy in solving this task is supposed to be reached with five cam- 
eras. First, for the purpose of surveillance of the task-solving process one 
camera (so-called ceiling- camera CAi) is fastened at the ceiling. The optical 
axis is oriented into the center of the working area, and the objective is of 
middle focal length {e.g. 6 mm) such that the whole scene is contained in 
the field of view. Second, two cameras are fastened at the robot hand, the 
first one (so-called hand-camera CA^) is used for localizing the target ob- 
ject and the second one (so-called hand-camera CA^) is used for controlling 
the grasping process. For hand-camera CA 2 the viewing direction is approx- 
imately parallel to the fingers and the objective is of small focal length {e.g. 
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4.2mm), i.e. a large part of the domestic area should be observable at a rea- 
sonable resolution. For hand-camera CA^ the viewing direction is straight 
through the fingers {i.e. approximately orthogonal to the one of CA^), and 
the objective is of middle focal length {e.g. 6mm) such that both the fingers 
and a grasping object are contained in the field of view. Third, two cameras 
of the robot head (so-called head-cameras CAj^ and CA^) are used for con- 
trolling the placement of obstacle objects at the parking area or of the target 
object at the inspection area. The head-cameras are used also for detailed 
object inspection. Depending on the category of the carried object (obstacle 
or target object) the DOFs of the robot head must be changed appropriately 
such that the parking area or the inpection area will appear in the field of 
view, respectively. The pan and tilt DOF of the robot head are from —90° 
to -1-90° degrees each. The vergence DOF for each camera is from —45° to 
-1-45° degrees. The focal length of the head-cameras can vary between 11mm 
and 69mm. 

Abstract Visualization of the Original Scene 

Figure 4.9 shows an abstract depiction of the original scene. For the purpose 
of task decomposition we introduce five virtual points P 2 , P 3 , P 4 , P 5 , which 

will serve as intermediate positions of the trajectories of the robot hand (see 
below) . Generally, these positions are represented in the coordinate system of 
the robot arm which is attached at the static basis. Position Pi is the starting 
point of the robot hand for solving the task, positions P 2 and P3 are located 
near the domestic area and serve as starting points for actively locating the 
target object and grasping the objects, respectively, position P4 is the point 
from which to start the servoing procedure for placing an obstacle object on 
the parking area, and finally position P5 is the point from which to start the 
servoing procedure for placing the target object on the inspection area. 

The decomposition of the high-level, deliberate task into sub-tasks, and 
the configuration and implementation of task-specific modules is based on 
an experimental designing phase. Therefore, in the following subsections the 
designing phase and the application phase are explained in coherence for each 
sub-task. 

4.3.2 Localization of a Target Object in the Image 

In the application phase the first goal is to find a target object among a set 
of objects which are located on the domestic area. 

Designing Aspects for Localization of a Target Object 

For the localization of a target object an operator is needed which should be 
robust and efficient (as has been studied in Chapters 2 and 3, extensively). 
Usually, the application phase leaves open certain degrees of freedom in ar- 
ranging cameras and taking images. It is an issue of the designing phase to 




4.3 Treatment of an Exemplary High-Level Task 209 




determine viewing conditions which reveal optimal robustness and efficiency 
in object localization. A prerequisite for obtaining robustness is to keep sim- 
ilar viewing conditions during the learning phase and the application phase. 
In general, it is favourable to take top views from the objects instead of tak- 
ing side views because in the latter case we have to deal with occlusions. 
However, there may be constraints in taking images, e.g. limited free space 
above the collection of objects. These aspects of the application phase must 
be considered for arranging similar conditions during the learning phase. A 
prerequisite for obtaining efficiency is to keep the complexity of the appear- 
ance manifold of a target object as low as possible. This can be reached in the 
application phase by constraining the possible relationships between camera 
and target object and thus reducing the variety of viewing conditions. 

In the designing phase, we demonstrate views from the target object and 
from counter objects (for the purpose of learning thereof), but consider con- 
straints and ground truths and exploit degrees of freedom (which are sup- 
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posed to be relevant in the application phase). Based on this, the system 
learns operators as robust and efficient as possible (under the supervision of 
the designer). Additionally, the system should come up with desirable geo- 
metric relationships between camera and object (subject to the degrees of 
freedom in the application phase). For example, as a result of the learning 
process we may conclude that objects should he observed from top at a certain 
distance to the ground and the optical axis of the hand-camera CA2 should 
be kept normal to the ground plane. In Figure 4.9 we introduced a virtual 
point P2, which is straight above a certain corner of the domestic area and 
the normal distance from this area is the optimal viewing distance. 

Additionally, we may conclude that under this optimal viewing distance 
the field of view of the camera is less than the size of the domestic area. 
However, the set of objects is spread throughout the whole area and only a 
sub-set can be captured in a single image. Consequently, in the application 
phase the robot arm has to move the hand-camera CA2 horizontally over the 
domestic area (at the optimal viewing distance, which defines the so-called 
viewing plane) and take several images step by step. For example, a hori- 
zontal meander-type movement would be appropriate with the starting point 
P2- Figure 4.10 shows an intermediate step of this movement. On the left, 
the domestic area is shown and the hand-cameras, the optical axis of the 
camera CA2 is directed normal (approximately) to the ground. The image 
on the right is obtained by the hand-camera CA2 which depicts only a part 
of the domestic area. Based on the specific interplay between the characteris- 
tics of sub-task, environment, and camera, it is necessary to execute camera 
movements according to a certain strategy. 




Fig. 4.10. (Left) Domestic area and hand-cameras; (Right) Part of the domestic 
area taken by the hand-camera. 



The specific shape of the meander-type movement must be determined 
such that a complete image can be constructed from the large domestic area, 
i.e. the complete image is fitted together from single images. Generally, the 
stopping places of the hand-camera CA2 should be chosen such that the col- 
lection of single images does not leave any holes in the domestic area which 
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may not be captured. On the other hand, one should avoid that single im- 
ages in the neighborhood do capture too much overlap, because this reduces 
efficiency in object localization due to repeated applications in the overlap- 
ping image areas. Interestingly, our approach of object recognition requires a 
certain degree of overlap which is based on the size of the rectangular object 
pattern used for recognition. In the case that the target object is partly out- 
side the field of view, the learned operator for target localization can not be 
applied successfully to this image. However, if we arrange an overlap between 
neighbored images of at least the expected size of the target pattern, then the 
object is fully contained in the next image and can be localized successfully. 

In consequence of this, for determining the relevant increments for the 
stopping places of the hand-camera CA 2 (and taking images) we must take 
into account the size of the appearance pattern of the target object}^ Fur- 
thermore, a kind of calibration is needed which determines the relationship 
of pixel number per millimeter (see Subsection 4.4.2 later on). The principle 
is demonstrated in Figure 4.11. On top left the size of the image is depicted 
and on top right the size of the rectangular object pattern. On bottom left 
and right the domestic area is shown (bold outlined), with the meander and 
stopping places of the hand-camera CA 2 depicted on the left, and the series 
of overlapping images depicted on the right. The size of the rectangular ob- 
ject pattern correlates with the overlap between neighbored single images, 
and consequently the target pattern is fully contained in a certain image 
regardless of its location on the domestic area {e.g., see the two occurrences). 

Generally, the locations of all three particular ground planes (domestic 
area, parking area, and inspection area) are determined by a teach-in ap- 
proach. The designer uses the control panel of the robot arm and steers the 
tip of the robot hand in succession to certain points of the particular areas, 
e.g. four corner points of the rectangular domestic and parking area, respec- 
tively, and a point at the center and the boundary of the circular inspection 
area, respectively. The respective positions are determined and memorized 
automatically based on inverse kinematics. In consequence of this, the po- 
sitions of these areas are represented in the basis coordinate system of the 
robot arm. The virtual positions Pi, P 2 , P 3 , P 4 , P 5 are also represented in this 
coordinate system and the specific relations to the three areas are based on 
certain experiments. 

Position Pi is specified dependent on the relationship between parking 
and inspection area {e.g. middle point between the areas). Position P 2 is 
specified based on the location of the domestic area. Specifically, as a result 
of experiments on learning operators for object recognition one determines 
the relation between P 2 and the domestic area, as explained above. Starting 
at position Pi the robot hand must move to position P 2 and there continue 

This strategy shows exemplary the degree by which robot effector movements 
and image analysis techniques must work together, i.e. perception and action 
are strongly correlated for solving sub-tasks. 
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Fig. 4.11. Meander-type movement of hand-camera CA2 and taking images with 
an overlap which correlates with the size of the rectangular object pattern. 



with a meander-type movement over the domestic area. The movement of 
the robot hand from position Pi to P2 may also be specified by a teach-in 
approach, i.e. the designer supplies intermediate positions for approximating 
the desired trajectory. The shape and stopping places of the meander-type 
movement over the domestic area are determined based on experiments on 
object localization. 

Task-Specific Modules for Localization of a Target Object 

As a result of the experimental designing phase we define a task-specific 
module MT\ which is based on the execution of a generic module of type 
Mil, followed by a generic module of type M/2, followed by a generic module 
of type M/3, followed again by a generic module of type M/2. 

MTi := ( Mil & M/2 & M/3 & M/2 ) 



( 4 . 8 ) 
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The module MI\ is expecting from a human operator to steer the robot 
hand to certain points on the domestic area in order to obtain relevant posi- 
tions in the coordinate system of the robot arm. Especially, also the starting 
position P 2 of the meander-type movement over the domestic area can be 
determined thereof. Furthermore, the human operator must teach the sys- 
tem a certain trajectory from original position P\ to position P 2 - The first 
occurrence of module MI 2 is responsible for moving the robot hand from 
position Pi to position P 2 along the specified trajectory. The module M /3 
is responsible for the meander-type movement over the domestic area and 
taking a series of images. The second occurrence of module M /2 moves the 
robot hand back to starting position P 2 . Pre-processing functions must de- 
termine the meander structure on the viewing plane over the whole domestic 
area, and define (few) intermediate stops as discussed above. Post-processing 
functions will localize the target object in the collection of single images. The 
output of the module is an index of the image containing the target object 
and relevant position in the image. 

The generic module M /3 (explained in Subsection 4.2.2) has been ap- 
plied for taking images during the application phase. In addition to this, this 
generic module can also be applied during the designing phase for two other 
purposes. First, we must take images from the target and from counter ob- 
jects under different viewing conditions such that the system can learn an 
operator for object recognition (see Section 3.2). Second, we must take images 
from artificial or natural calibration objects such that the system can approx- 
imate the transformation between image and robot coordinate systems (see 
Subsections 4.4.1 and 4.4.2 later on). The responsible task-specific modules 
are similar to MTi and therefore are not presented in this work. 

4.3.3 Determining and Reconstructing Obstacle Objects 

In the application phase the second goal is to determine and reconstruct 
possible obstacle objects which prevent the robot hand from approaching the 
target object. 

Designing Aspects for Determining/Reconstructing Obstacles 

In order to design task-specific modules for determining obstacle objects we 
must deal with the following aspects. 

The purpose of reaching the target object is to grasp it finally. The critical 
issue of object grasping is to arrange a stable grasping situation under the 
constraint of occupying little space in the grasping environment (the latter is 
for simplifying the problem of obstacle avoidance during grasping) . The hand 
of our robot arm is equipped with parallel jaw fingers, and additional space 
is occupied by the two hand-cameras and the fixing gadget (see Figure 4.10 
(left)). According to the specific architecture it is favourable to grasp objects 
by keeping the fingers horizontally. We assume that the fingers should trans- 
late and/or rotate in a horizontal plane (so-called grasping plane), which is a 
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virtual copy of the domestic plane with a certain vertical offset. The specific 
point P 3 belongs to this grasping plane and is used as the starting position 
for incrementally reaching a grasping situation at the target object. 

The point P 3 should be defined advantageous with regard to the problem 
of obstacle avoiding, e.g. it is convenient to construct the point as follows. 
We determine a straight line between a center point of the robot arm {e.g. 
origin of coordinate system) and the position of the target object, vertically 
project this line onto the grasping plane, and take the intersecting boundary 
point of the grasping plane, which is nearest to the robot arm, as the point P 3 
(see top view from robot arm and domestic area in Figure 4.12). The target 
object can be grasped only if there is a enough free space along the route 
of approaching the object. Generally, the space between the robot hand and 
the robot center is occupied to a certain extent by the robot arm and its 
3D volume is non-rigid. In consequence of this, the system must determine 
a route to the target object such that the requested 3D space for the robot 
arm is not occupied by obstacle objects (see Figure 4.13). 




Fig. 4.12. Constructing point P 3 based on center position of robot arm and 
position Pg of target object. 



It may happen that objects are located quite densely and no collision-free 
route to the target object is left. In this case, the relevant obstacle objects 
must be determined and carried to the parking area. In our application, the 
two sub-tasks are sequentialized completely, i.e first determining all relevant 
obstacles and second carrying them away (the latter is treated later on). It 
is favourable to determine potential obstacle objects by moving the hand- 
camera CA 2 over the domestic area with the optical axis directed vertically. 
Based on the collection of single images one obtains a global impression from 
the arrangement of all objects which is necessary for planning the collision- 
free route automatically. More concretely, we only need to know those objects 
which are located between the target object and the robot arm, and the other 
objects are not relevant, because they are not on the path of approaching the 
target. Based on the global impression the system should determine those 
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Fig. 4.13. (Left) Grasping a target object at position Pq and collision with 
obstacle object at position Ph; (Right) Grasping without obstacle collision. 



route to the target object along which a minimum number of obstacles is 
located, i.e. carrying away only a minimum number of obstacles. 

For planning the collision-free route we take the non-rigid 3D volume of 
the robot arm into account, which changes continually during the movement 
of the robot hand. In addition to this, we must generate an approximation 
for the three-dimensional shape of the set of potentially relevant objects. The 
non-rigid volume of the robot arm and the approximated object volumes 
must not overlap. Consequently, during the movement of the hand-camera 
CA 2 over the domestic area we must take images according to a strategy such 
that it is possible to reconstruct 3D information from the relevant objects. 
The structure-from-motion- stereo paradigm can be applied for 3D reconstruc- 
tion, i.e. the hand-camera CA 2 moves in small steps and takes images, 2D 
positions of certain image features are extracted, correspondences between 
the features of consecutive images are determined, and finally 3D positions 
are computed from 2D correspondences. Essentially, for simplifying the cor- 
respondence problem we must take images with small displacements, which 
is quite different from the strategy for localizing the target object (see Sub- 
section 4.3.2). 

A specific version of the structure-from-motion-stereo paradigm is imple- 
mented by Blase [22].^^ The 3D shape of an obstacle object is approximated 
on the basis of a collection of points which originate from the surface of the 
object. In the images these points are determined as gray value corners and 
for their detection the SUSAN operator is applied. Correspondences of gray 
value corners between consecutive images are obtained by normalized cross 
correlation of small patterns which are centered at the extracted features. 

For example. Figure 4.14 shows corresponding gray value corners between 
consecutive images. The function for 3D reconstruction from two correspond- 
ing 2D positions is approximated by a mixture of radial basis function net- 
works which must be trained with the use of a calibration pattern. In the 
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application phase the camera moves along in front of a collection of objects, 
and during this process some of the objects appear in the field of view and 
others may disappear (see Figure 4.15). Continually, 3D surface points are 
reconstructed from pairs of 2D positions of corresponding gray value corners. 
The collection of 3D points must be cleared up from outliers and clustered 
according to coherence (see left and right image in Figure 4.16). The convex 
hull of each cluster is used as an approximation of the 3D volume of the 
scene object. The approach is only successful if there is a dense inscription or 
texture on the surface of the objects {e.g. inscriptions on bottles). Otherwise, 
a more sophisticated approach is necessary, e.g. extracting object boundaries 
from images by using techniques presented in Chapter 2. 



256 256 256 256 256 256 




Fig. 4.14. Extracting gray value corners in two consecutive images; detecting 
correspondences. 
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Fig. 4.15. Four consecutive images and extracted gray value corners (black dots). 



3D boundary approximations of objects are the basis for planning colli- 
sion-free routes over the domestic area. In a previous sub-task the target 
object has been localized in the image and discriminated from all other ob- 
jects. Consequently, we can distinguish the 3D boundary of the target object 
from the 3D boundaries of the other objects. On the other hand, we need to 
know the volume of the robot arm which is different for each position of the 
robot hand. This volume can be determined by computing the inverse kine- 
matics and, based on this, taking the fixed state vector of the robot arm into 
account, e.g. length and diameter of links. We decided that the approach- 
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Fig. 4.16. (Left) Collection of 3D points originating from the bottles in Fig- 
ure 4.15; (Right) Clusters of 3D points cleared up from outliers. 



ing of the target object must be carried out on a horizontal plane to which 
point P 3 belongs to, i.e. this point is the starting position of the movement 
(see Figure 4.9). A possible planning strategy for obtaining a collision-free 
route to the target object works with the methodology of dynamical systems. 
Concretely, the following six steps are involved. 

First, the three-dimensional boundary shapes of all relevant objects are 
projected vertically on the grasping plane, which results in silhouettes of the 
objects viewed from the top. Second, we specify an attractor on the grasping 
plane at the center position of the silhouete of the target object, i.e. we 
assume this is the place from where to grasp the target object. Third, we 
spread out a set of repellors equidistantly over the silhouette boundaries of 
each other object, respectively. This multitude of repellors for each potential 
obstacle object is useful for keeping the robot hand off from any part of the 
object surface. Fourth, the attractor and repellor vector fields are summed up. 
From the resulting field we can extract movement vectors leading towards the 
target object, if there is a possible route at all. However, so far we did not take 
into account the volume of the robot arm. Fifth, for this purpose we apply 
an exploration approach, which will take place in virtual reality including 
virtual movements of the robot hand. Based on the previously determined 
movement vectors the robot hand is moving virtually along a suggested vector 
(beginning from point P3), and for the new position it is tested whether 
a virtual collision with an object has occured. If this was the case, then 
a repellor is specified for this hand position, and the hand is moved back 
virtually and another hypothetical movement vector can be tried again. Sixth, 
if there is no collision-free route towards the target object we determine the 
object which is located on or near the straight line between position P 3 and 
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the position of the target object and is most closest to P3. This object is 
considered as obstacle which must be carried to the parking area. 

The original sub-task of approaching the target object must be defered for 
a moment. The new sub-task of approaching the obstacle object is reflected 
in the vector field representation as follows. We take a copy from the original 
vector held, erase the effector at the position of the target object, erase all 
repellors at the relevant obstacle object (determined as discussed above), and 
specify an attractor at the position of this relevant obstacle object. Based on 
the updated superposition of attractor and repellor vector fields the robot 
hand can approach the obstacle object for grasping it Anally. 

Task-Specific Modules for Determining/Reconstructing Obstacles 

As a result of the experimental designing phase we define two task-specific 
modules which must be applied in the application phase sequentially. 

The first task-specific module MT2 is responsible for the meander-type 
movement over a part of the domestic area and taking a series of images 
(generic module M/3), reconstructing the boundary of target object and 
obstacle objects, determining point P3, moving the robot hand to position 
P3 and rotating robot fingers parallel to grasping plane (generic module M/2). 

MT2 := ( M/3 & M/2 ) (4.9) 

The meander-type movement is restricted to the sub-area between robot 
center and target object, and images are taken in small incremental steps. 
The pre-processing function must determine the meander structure on the 
viewing plane over a part of the domestic area, and define many interme- 
diate stops (as discussed above). Post-processing functions are responsible 
for the extraction of object boundaries, detection of correspondences, 3D re- 
construction, projection on grasping plane, and determining point P3 on the 
grasping plane. The output of the module comprises the silhouette contours of 
target object and the other objects, and the starting point P3 on the grasping 
plane. 

The second task-specific module MT3 is responsible for determining the 
obstacle object (located nearest to point P3) which prevents the robot hand 
from approaching the target object. This module is different from the previ- 
ous task-specific modules in that no real but only virtual movements of the 
robot hand will take place, i.e. it is a task-specific planning module. The over- 
all vector field must be updated by doing virtual exploratory movements for 
avoiding collisions with robot body. A pre-processing function must construct 
an overall vector field from silhouette contours of target object and obsta- 
cle objects. Post-processing functions determine the relevant obstacle object, 
modify the vector field such that the obstacle object can be approached. The 
outcome is a vector field for approaching the obstacle object. 
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4.3.4 Approaching and Grasping Obstacle Objects 

In the application phase the third goal is to approach and grasp an obstacle 
object. 

Designing Aspects for Approaching/Grasping Obstacles 

The obstacle object has been determined such that during the approaching 
process no collision will occur with other objects. Starting at position P3 the 
obstacle object can be approached by following the force vectors until the 
equilibrium (attractor center) is reached. In simple cases, an instructional 
module of type MI2 is applicable which executes an assembled instruction 
without continual visual feedback. The usefulness of this strategy is based on 
the assumption that the position of the obstacle object can be reconstructed 
exactly and that the object can be grasped arbitrary. This assumption does 
not hold in any case, and therefore we present a more sophisticated strategy 
which combines deliberate plan execution with visual feedback control. 

The first part of the strategy is equal to the instructional movement as 
mentioned just before, i.e. the robot hand will approach the obstacle object 
by following the vectors of the deliberate vector field. However, the plan must 
be executed merely as long as the obstacle object is not contained in the field 
of view of hand-camera GA3 or is just partly visible. The hand-camera CA^ 
is supposed to be used for a fine-tuned control of the grasping process, and 
for this purpose both the grasping fingers and the grasping object must be 
located in the field of view. As soon as the object is visible completely, the 
plan is interrupted and a process of visual feedback control continues with 
the sub-task of grasping. The robot hand must be carefully servoed to an 
optimal grasping situation, i.e. a high accuracy of assembling is desired. For 
this purpose, the hand-camera CA3 must be equipped with an objective such 
that both robot fingers and at a certain distance the grasping object can be 
depicted at a reasonable resolution. 

It is assumed that the silhouette of the object is elongated (instead of 
round) and the length of the smaller part is less than the distance between the 
two grasping fingers, i.e. grasping is possible at all. For a detailed explanation 
of the servoing strategy we introduce for the robot hand a virtual hand axis 
and a virtual gripper point. The virtual hand axis is the middle straight line 
between the two elongated fingers. The virtual gripper point is obtained by 
first computing the end straight line, which connects the two finger tips, and 
then intersecting this line with the virtual hand axis. Furthermore, we also 
introduce a virtual object axis of the grasping object which is defined as the 
first principal component axis of the object silhouette. 

In Subsection 3 . 4.1 we discussed about strategies for an efficient recog- 
nition of objects or situations, e.g. reducing the complexity of the mani- 
fold of appearance patterns. For keeping the manifold of grasping situations 

Sophisticated work on automated grasp planning has been done by Rohrdanz et 

al. [142]. 
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tractable we found it favourable to sequentialize the grasping process into 
four phases (see Figure 4.17). First, the robot hand must reach a situation 
such that the virtual hand axis is running through the center position of 
the object silhouette. Second, the robot hand should move perpendicular to 
the virtual hand axis until the virtual gripper point is located on the virtual 
object axis. Third, the robot hand must rotate around the virtual gripper 
point for making the virtual hand axis and the virtual object axis collinear. 
Fourth, the robot hand should move along the virtual hand axis until an 
optimal grasping situation is reached. All four phases must be executed as 
servoing procedures, including continual visual feedback, to take care for in- 
accuracies or unexpected events. Figure 4.18 shows the intermediate steps of 
the grasping process in real application. In the following we present examples 
for the type of measurements in the images and for the control functions on 
which the behavioral modules are based. 

Four Phases of a Robotic Grasping Process 

The aspects of the first phase of the grasping process are included more or 
less in the successive phases and therefore is not discussed specifically. 

For the second phase we may work with the virtual gripper point and the 
virtual object axis explicitly. The virtual gripper point can be extracted by a 
combination of gray value and geometric features as follows. By normalized 
cross correlation the gripper tip is located roughly (see Figure 4.19). In order 
to verify the place of maximum correlation and localize the position of the 
virtual gripper point exactly we additional extract geometric features of the 
fingers. 

Hough transformation can be used for extracting the elongated straight 
lines of the finger boundaries. Under the viewing perspective of the hand- 
camera CA^ the two top faces of the fingers are visible clearly and appear 
brightly, and the two silhouettes of the top faces mainly consist of two elon- 
gated lines, respectively. Taking the polar form for representing lines, the 
Hough image is defined such that the horizontal axis is for the radial dis- 
tance and the vertical axis is for the orientation of a line. According to this, 
the two pairs of elongated boundary lines of the parallel jaw gripper occur 
in the Hough image as four peaks which are nearly horizontal due to similar 
line orientations (see Subsection 2.3.1). According to these specific pattern 
of four peaks the elongated finger lines are extracted and from those also the 
virtual hand axis. The end straight line which connects the two fingers tips 
virtually, can be extracted by using Hough transformation in combination 
with the SUSAN corner detector (see Subsection 2.2.3). The virtual gripper 
point is determined by intersecting the relevant lines (see Figure 4.20). On 
the other hand, we will extract the virtual axis of the grasping object as the 
first principal component axis of the object silhouette. 

Based on all these features, we can define the measurement vector Q{t) for 
the servoing procedure as the euclidean distance between the virtual gripper 
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Fig. 4.17. Grasping process organized in three phases, i.e. perpendicular, rota- 
tional and collinear phase of movement. 



point and the virtual object axis normal to the virtual hand axis. The desired 
measurement vector Q* is simply the value 0. The control vector for moving 
the robot hand is biased, i.e. leaving only one degree of freedom for moving 
the robot hand on the grasping plane perpendicular to the virtual hand axis. 
Especially, both the measurement vectors and the control vectors are scalar. 
A constant increment value s is prefered for easy tracking the movement, 
and a reasonable value is obtained in the experimental designing phase. That 
is, a Jacobian must be determined which describes the relationship between 
displacements of the robot hand and the resulting displacements in the image 
of hand-camera CA 3 (see later on Subsection 4.4.2). In this special case the 
Jacobian is a trivial matrix containing just the constant value s. Then, the 
control function is defined as follows. 
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Fig. 4.18. Assembling the gripper to an object. 
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Parameter rji specifies the acceptable deviation from desired value 0. 

In the third phase of the grasping process the robot hand should rotate 
around the virtual gripper point for reaching collinearity between the virtual 
hand axis and the virtual object axis. Just one degree of freedom of the 
robot hand must be controlled for rotating in the grasping plane, and the 
remaining state variables of the robot hand are constant during this sub-task. 
The type of measurements in the image can be equal to the previous sub- 
task of perpendicular hand movement, i.e. extracting virtual hand axis and 
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Fig. 4.20. Construction of finger boundary lines, virtual hand axis, end straight 
line, and virtual finger point. 



virtual object axis. Based on this, we compute the angle between both axes 
for specification of a scalar measurement vector Q{t), which should reach the 
desired value 0 finally. The control function is equal to the one presented in 
equation (4.10). Instead of extracting geometric features explicitely, we briefly 
mention two other types of measurements in the image which characterize 
the orientation deviation more implicitly. 

The log-polar transformation can be applied for tracking the obstacle ob- 
ject in the LPT image by techniques of cross correlation (see explanations in 
Subsection 3.4.2). The camera executes a simple rotational movement around 
the virtual gripper point, which implies the impression that the object is ro- 
tating around this point (see in Figure 4.17 the images on top right and 
bottom left). If we compute log-polar transformation continually with the 
center of the foveal component defined as the virtual gripper point, then the 
LPT pattern is translating along the 6 axis of the polar coordinate system 
(see exemplary the Figures 3.21 and 3.22). Cross correlation is applicable 
for localizing the LPT pattern of the grasping object. Due to the incremen- 
tal movements the relevant search area can be constrained which is useful 
for reasons of efficiency. The virtual hand axis and virtual object axis are 
approximately collinear, if and only if the LPT pattern is located on the 
vertical axis of the LPT image defined by 0 = 0. Therefore, alternatively to 
the previous type the scalar measurement vector Q(t) can be defined also by 
the current value of 9, which is supposed to represent the center of the LPT 
pattern of the obstacle object along the horizontal axis of the LPT image. 
That is, this value of 6 serves as a measurement of the orientation deviation. 

Alternatively to the approaches of extracting virtual axes or the log-polar 
transformation, we can determine the orientation deviation based on his- 
tograms of edge orientations. The approach is easy if the background is ho- 
mogeneous and the held of view only contains the Angers and the grasping 
object. In this case the original image can be transformed into a binary image 
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representing gray value edges. We perserve the edges orientations and con- 
struct a histogram thereof. The left and right diagram in Figure 4.21 show 
these histograms prior and after the cycles of rotational gripper movement 
(for second and third image in Figure 4.18). The position of the first peak in 
Figure 4.21 (left), which is close to 65°, specifies the principal orientation <j>°^ 
of the object and the second larger peak (z.e. close to 90°) the gripper ori- 
entation in the image. During the servoing cycle the gripper orientation 
changes but due to the fastened camera a change of the object orientation 
appears. Accordingly, the first histogram peak must move to the right until 
it unifies into the second peak (Figure 4.21, right). The current measurement 
vector Q{t) is defined by and the desired measurement vector Q* by 




Edgeorien+a+oon Edgeorien+a+oon 



Fig. 4.21. Edge orientation histograms; (Left) Prior to finger rotation; (Right) 
After finger rotation. 



In the fourth phase of the grasping process the robot hand should move 
collinear with the virtual hand axis in order to reach an optimal grasping 
situation. For defining grasping situations we can take the virtual gripper 
point and the object center point into account, e.g. computing the euclidean 
distance between both. If this distance falls below a certain threshold, then 
the desired grasping situation is reached, else the gripper translates in small 
increments. 

An alternative approach for evaluating the grasping stability has been 
presented in Subsection 3.3.4 which avoids the use of geometric features. 
A GBF network learns to evaluate the stability of grasping situations on 
the basis of training examples. Those example situations are represented as 
patches of filter responses in which a band pass filter is tuned to respond 
specifically on certain relationships between grasping fingers and object. The 
filter responses represent implicitly a measurement of distance of the gripper 
from the most stable position. For example, if the gripper moves step by 
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step to the most stable grasping pose and then moves off, and sample data 
are memorized thereof, then the network may learn a parabolic curve with 
the maximum at the most stable situation. A precondition for applying the 
approach is that gripper and object must be in a small neighborhood so that 
the filter can catch the relation. 

Instead of computing for the vector of filter responses a value of grasping 
stability it is possible to associate an appropriate increment vector for moving 
the gripper. In this case, the control function is implemented as a neural 
network which is supposed to be applied to a filter response vector. We do 
not treat this strategy in more detail. 

Task- Specific Modules for Grasping Obstacle Objects 

As a result of the experimental designing phase we define a task-specific 
module MT4 which is based on the execution of a generic module of type 
MB3, followed by the sequential execution of four generic modules each of 
type MBi, followed by an elementary instruction of type M/2. 



MT4 := ( MBs & MBi & MBi & 

MBi & MBi & MI 2 ) (4.11) 

The module of type M Bs is responsible for approaching the robot-hand to 
the obstacle object by following the vectors of the deliberate vector field. The 
plan is interrupted as soon as the obstacle object is completely visible in the 
field of view of the hand-camera CA3. The four generic modules of type MBi 
implement elementary behaviors, i.e. the first one is responsible for arranging 
the robot hand such that the virtual hand axis is running through the center 
position of the object silhouette, i.e. the second one is responsible for the 
perpendicular translation of the robot hand relative to the virtual hand axis, 
the third one is responsible for the relevant rotation, and the fourth one for 
collinear translation for reaching the optimal grasping situation (see above) . 
The instructional module of type MI2 implements the closing of the fingers 
for grasping. 



4.3.5 Clearing Away Obstacle Objects on a Parking Area 

In the application phase the fourth goal is to clear away obstacle objects on 
a parking area. 

Designing Aspects for Clearing Away Obstacle Objects 

The robot hand should lift up the obstacle object a pre-specified distance and 
move it backward step by step to a virtual point above starting position P3, 
i.e. position P3 modified in the Z-coordinate by the pre-specified distance. 
Then, the object must be carried from the domestic area to the parking area 
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along a pre-specified trajectory. The goal position of this trajectory is a virtual 
point P4 located a certain distance above the parking area. Finally, the object 
must be carefully servoed to a certain place on the parking area in order to 
put down the object. For solving these three sub-tasks we must take other 
cameras into account for grounding processes of visual feedback control. The 
hand-cameras CA2 and CA3 are no more useful because the fields of view 
are occupied to a large extent by the grasped object, and furthermore this 
object occludes the space behind it. 

The head-cameras CA^ and CA^ are intended for the visual feedback 
control of approaching the object to a free place on the parking area. We 
assume that a relevant free place has already been found {i.e. do not treat 
this sub-task), and that the virtual point P4 has been determined as an 
appropriate starting position for approaching this place. The head-cameras 
must be aligned appropriately using the pan, tilt, and vergence degrees of 
freedom, e.g. the optical axes should be directed to the center of the parking 
area. The zooming/focusing degree of freedom should be tuned such that 
the area and the position from where to start the approaching process are 
completely contained in the field of view, at the maximal possible resolution, 
and under the optimal focus. In a later sub-task the head-cameras CA4 and 
CA5 will also be used to control the process of approaching an object to 
a place on the inspection area. We assume that the double usage of the 
head cameras is possible without changing the position or orientation of the 
robot vehicle (containing the robot head) . Instead, for solving the inspection 
task (later on) only the degrees of freedom of robot head and cameras are 
supposed to be changed appropriately, i.e. keeping similar viewing constraints 
for the inspection area which have been mentioned for the parking area. 
The domestic area and inspection area are fixed and therefore the robot 
vehicle can be steered into the relevant position prior to the application phase. 
Relative to this fix vehicle position the two relevant state vectors of the head- 
camera system can be determined for optimally observing the parking area 
and the inspection area, respectively.^® 

The head-cameras must take images continually for the visual feedback 
control of putting down an object on a goal place of the parking area. In each 
of the stereo images both the object and the goal place are visible, which is a 
precondition for determining a certain kind of distance. Based on the distance 
measurements in the two images, a control vector is computed for carrying 
the object nearer to the goal place. This principle can also be applied for 
treating the peg-in-hole problem, e.g. in a system implemented by Schmidt 
we used cyclinders and cuboids [150].^® The critical issue is to extract the 
relevant features from the stereo images. 

Task-specific modules are needed for supporting the system designer in the sub- 
task of vehicle and camera alignment. However, in this work we don’t care about 

that. 
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For example, let us assume a cylindrical object and a circular goal place 
as shown in the top left image of Figure 4.22. The binary image which has 
been computed by thresholding the gradient magnitudes is depicted on top 
right. In the next step, a specific type of Hough transformation is applied for 
approximating and extracting half ellipses. This specific shape is supposed 
to occur at the goal place and at the top and bottom faces of the object. 
Instead of full ellipses we prefer half ellipses, concretely the lower part of 
full ellipses, because due to the specific camera arrangement that feature is 
visible throughout the complete process. From the bottom face of the object 
only the specific half ellipse is visible. The process of approaching the object 
to the goal place is organized such that the lower part of the goal ellipse 
remains visible, but the upper part may become occluded more and more by 
the object. The extraction of the lower half of ellipses is shown in the bottom 
left image in Figure 4.22. The distance measurement between object and goal 
place just takes the half ellipse of the goal place and that from the bottom 
face of the object into account. For computing a kind of distance between the 
two relevant half ellipses we extract from each a specific point and based on 
this we can take any metric between 2D positions as distance measurement. 
The bottom right image in Figure 4.22 shows these two points, indicated by 
crosses, on the object and the goal place. 

The critical aspect of extracting points from a stereo pair of images is 
that reasonable correspondences must exist. A point of the first image is in 
correspondence with a point of the second image, if both originate from the 
same 3D point. In our application, the half ellipses extracted from the stereo 
images are the basis for determining corresponding points. However, this is 
by no means a trivial task, because the middle point of the contour of the 
half ellipse is not appropriate. The left picture of Figure 4.23 can be used 
for explanation. A virtual scene consists of a circle which is contained in 
a square. Each of the two cameras produces a specific image, in which an 
ellipse is contained in a quadrangle. The two dotted curves near the circle 
indicate that different parts of the circle are depicted as lower part of the 
ellipse in each image. In consequence of this, the middle points p\ and p 2 
on the lower part of the two ellipses originate from different points P\ and 
P 2 in the scene, i.e. points pi and p 2 do not correspond. Instead, the right 
picture of Figure 4.23 illustrates an approach for determining corresponding 
points. We make use of a specific geometric relation which is invariant under 
geometric projection. 

We translate, virtually, the bottom line of the square to the circle which 
results in the tangent point P. This procedure is repeated in the two images, 
e.g. translating the bottom line of the quadrangle parallel towards the ellipse 

In Subsection 1.4.1 we discussed about compatibilities of regularities under ge- 
ometric projection. They have been proven advantageous in Chapter 2 for ex- 
tracting object boundaries. In this subsection we present another example for 
a compatibility of regularities under geometric projection. It will be useful for 
extracting relevant image features from which to determine correspondences. 
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Fig. 4.22. (Top left) Cylinder object and a circular goal place; (Top right) Binary 
image of thresholded gradient magnitudes; (Bottom left) Extracted half ellipses; 
(Bottom right) Specific point on half ellipses of object and goal place. 



to reach the tangent points pi and p 2 ■ Due to different perspectives the two 
bottom lines have different orientations and therefore the resulting tangent 
points are different from those extracted previously (compare left and right 
picture of Figure 4.23). It is observed easily that the new tangent points 
Pi and p 2 correspond, i.e. originate from the single scene point P. In our 
application, we can make use of this compatibility. For this purpose one must 
be careful in the experimental designing phase to specify an appropriate state 
vector for the head-camera system. 

Especially, the parking area (including the boundary lines) must be com- 
pletely contained in the field of view of both head-cameras. In the application 
phase a certain part of the boundary lines of the parking area is extracted 
from the two images (the stereo correspondence between those specific image 
lines can be verified easily) . For each image the orientation of the respective 
boundary line can be used for determining relevant tangent points at the rel- 
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Fig. 4.23. (Left) Extracted image points pi and p 2 originate from different scene 
points Pi and P 2 , (Right) Extracted image points are corresponding, i.e. originate 
from one scene point P. 



evant ellipse, i.e. virtually move the lines to the ellipses and keep orientation. 
Tangent points must be extracted at the half ellipse of the goal place and at 
the half ellipse of the bottom face of the object. These points have already 
been shown in the bottom right image of Figure 4.22. 

For defining the control vector we need to describe the relationship be- 
tween displacements of the robot hand and the resulting displacements in the 
two stereo images taken by the head-cameras. For this purpose we introduce 
two Jacobians j({P) and j/(P) which depend on the current position P of 
the virtual gripper point. If we multiply the Jacobian j( (P) (respectively 
Jacobian j({P)) with a displacement vector of the hand position, then the 
product will reveal the displacement vector in the left image (respectively in 
the right image). The two Jacobians are simply joined together which results 
in a (4 X 3) matrix depending on P. 




(4.12) 



In order to transform a desired change from stereo image coordinates into 
manipulator coordinates the pseudo inverse J^{P) is computed. 

( {P) ■■= (P) ■ ( JO (P)) (P) (4.13) 

The current position P{t) of the virtual gripper point defines the variable 
state vector The desired measurement vector Q* is a 4D vector com- 

prising the 2D positions of a certain point of the goal place in the stereo 
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images. The current measurement vector Q(t) represents the stereo 2 D posi- 
tions of a relevant point on the object (see above). 

With these definitions we can apply the following control function. 



C{t) := 



s-{jfy{S^{t))-{Q*-Q{t)) : \Q*-Q{t)\>m (4,15) 

0 : else 



with the servoing factor s to control the size of the steps of approaching 
the goal place. The hand position is changed by a non- null vector C{t) if 
desired and current positions in the image deviate more than a threshold 771 . 
Actually, equation ( 4 . 15 ) defines a proportional control law (P-controller), 
meaning that the change is proportional to the deviation between the desired 
and the current position.^® 



Task-Specific Modules for Clearing Away Obstacle Objects 

As a result of the experimental designing phase we define a task-specific 
module MT5 which is based on the simultaneous execution of two generic 
modules of type M/2, followed by a generic module of type MBi, followed 
once again by a simultaneous execution of two basic modules of type M/2. 



MT5 := ( ( M/2 I M/2 ) & MBi & 

( MI2 I M/2 ) ) ( 4 . 16 ) 

The first module of type M/2 is responsible for lifting up the obstacle 
object a pre-specified distance and move it backward step by step to a virtual 
point above starting position P3, and from there carry the object to a virtual 
point P4 located a certain distance above the parking area. Simultaneously, 
another module of type M/2 can be executed for changing the degrees of 
freedom of the head system such that the parking area is located in the field 
of view. Next, the module of type MBi can be started, which is responsible 
for the elementary behavior of putting down the object at the goal place 
by visual feedback control. Finally, a module of type M/2 is responsible for 
opening the robot fingers, lifting up a certain extent, and moving the robot 
hand backward to position P2 of the domestic area. Simultaneously, another 
module of type M/2 can be executed for changing the degrees of freedom of 
the head system such that the original area is located in the field of view. 

Alternatively, the P-controller can be combined with an integral and a derivative 
control law to constrnct a PID-controller. However, the P-controller is good 
enough for this simple control task. 
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4.3.6 Inspection and/or Manipulation of a Target Object 

The fifth goal in our series of sub-tasks is the inspection and/or manipulation 
of a target object. For obtaining a collision-free route (on the domestic area) 
towards the target object one must repeat the relevant task-specific modules 
several times. These are task-specific module MT 3 for determining an obstacle 
object, the module MT 4 for approaching and grasping an obstacle object, and 
the module MT 5 for clearing away the obstacle object on the parking area. 

MTq ■= ( MT 3 & MT 4 & MTs ) * (4.17) 

The * symbol denotes a repetition of the succession of the three sub-tasks. 
In the case that no further object prevents the robot hand from approaching 
the target object, one can apply module MT 4 for approaching and grasping 
the target object, and finally apply a module which is similar to MT 5 for 
carrying the target object to the inspection area. We do not treat this anew, 
because only minor changes are required compared to the previous specifi- 
cations, e.g. carrying the target object to the virtual position P 5 and then 
putting it down to the center of the inspection area by visual feedback con- 
trol. According to this, we assume that the target object is already located 
on the inspection area. 

There is a wide spectrum of strategies for inspecting objects, but we con- 
centrate only on two approaches. First, we discuss criteria for evaluating view- 
ing conditions in order to obtain one optimal image from an object. Second, 
we present an approach for continual handling of an effector, which may carry 
a camera, for the visual inspection of large objects. 

Designing Aspects for Reaching Optimal Viewing Conditions 

For object inspection more detailed information about the target object must 
be acquired. We would like to take an image such that the object appears 
with a reasonable size or resolution, at a certain level of sharpness, and under 
a specific orientation. It is assumed that the target object is located at the 
center of the inspection area and the optical axes of the two head-cameras 
are aligned to this point. However, for the specific sub-task only one head- 
camera is used. For changing the size, resolution, or sharpness of the object 
appearance we must fine-tune the focal length or the lens position of the 
head-camera. In addition to this, appropriate object orientation is reached 
by controlling the angle of the rotary table. Figure 4.24 shows an object 
taken under large and small focal length (left and middle image), and under 
degenerate orientation (right image). 

The change of the depicted object size can be evaluated by image subtrac- 
tion, active contour construction, optical flow computation, etc. For example, 
an active contour approach [176] is simply started by putting an initial con- 
tour at the image center and then expanding it step by step until the back- 
ground image area of the object is reached which is assumed to be homoge- 
neous. Based on this representation it is easy evaluated whether the object 
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Fig. 4.24. Transceiver box, taken under large and small focal length, and under 
degenerate orientation. 



silhouette is of a desired size or locally touches the image border and thus 
meets an optimality criterion concerning depicted object size. In addition 
to this, the sharpness of the depicted object should surpass a pre-specified 
level. A measure of sharpness is obtained by computing the magnitudes of 
gray value gradients and taking the mean of a small percentage of maximum 
responses. High values of gradient magnitudes originate from the boundary 
or inscription of an object, and low values originate from homogenous areas. 
However, sharpness can best be measured at the boundary or inscription. 
Therefore, the critical parameter which determines a percentage of maximum 
responses, describes the assumed proportion of boundary or inscription of the 
object appearance relative to the whole object area. In the designing phase, 
this parameter must be specified, which typically is no more than 10 percent. 
The measurements of sharpness should be taken within the silhouette of the 
object including the boundary.^® 

The change of object resolution in the image can be evaluated by fre- 
quency analysis. Hough transformation, steerable filters, etc. For example, by 
using Hough transformation we extract boundary lines and evaluate distances 
between approximate parallel lines. A measure of resolution is based on the 
pattern of peaks within a horizontal stripe in the Hough image. Figure 4.25 
shows for the images in Figure 4.24 the Hough image, respectively. For the 
case of low (high) resolution the horizontal distances between the peaks are 
small (large). Having the object depicted at the image center the straight 
boundary lines of a polyhedral object can be approximated as straight image 
lines due to minimal perspective distortions. Maybe, in the previous phase of 
localizing the target object on the domestic area, the reliability of recognition 
was doubtful (see Subsection 4.3.2). Now, having this object located on the 
inspection area, one can identify it more reliablly by taking images under a 
general object orientation. For example, three visible faces of the transceiver 
box in Figure 4.24 (left and middle) are more useful than the degenerate 
object view in Figure 4.24 (right) which shows only two faces. Taking the 
peak pattern of the Hough transformation into account we can differentiate 
between general and degenerate views (see Figure 4.25, middle and right). 

Experiments on measurements of sharpness are presented in Subsection 4.4.4 

(later on). 
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According to this, the object can be rotated appropriately while preserving 
its position on the rotary table. 



Fig. 4.25. Hough transformation of binarized images in Figure 4.24. 



Task-Specific Modules for Reaching Optimal Viewing Conditions 

We discussed only superficially the designing-related aspects for reaching op- 
timal viewing conditions, and in a similar way we will discuss about relevant 
task-specific modules. The generic module of type MBi could be useful for 
changing the focal length incrementally with the goal of reaching a desired 
size for the object appearance in the image. Maybe, the control of the focal 
length must be accompanied with a control of the lens position for fine-tuning 
the sharpness. In this case a generic module of type MB2 would be appropri- 
ate for treating two goals in combination. However, in general it is difficult to 
specify goals explicitly which may represent optimal viewing conditions, e.g. 
it is difficult to describe the optimal object orientation. According to this, 
basic behavioral modules are useful which do not rely on explicit goals {e.g. 
desired measurements) and instead implement exploratory strategies, e.g. a 
generic module of type MB^. Depending on specific applications different 
exploratory behaviors are requested, however we do not treat this in more 
detail. 

Designing Aspects for Continual Handling of an Effector 

The effector may carry a camera for the visual inspection of large objects. Al- 
ternatively, the effector may also be configured as a tool which could be used 
for incremental object manipulation. In the following we treat both alterna- 
tive applications in common. Generally, it is required to move the effector, 
e.g. the robot hand in our robot system, along a certain trajectory and fur- 
thermore keep a certain orientation relative to the object. For example, we 
assume that a gripper finger must be servoed at a certain distance over an 
object surface and must be kept normal to the surface. 

In the following, we take the application scenery of dismantling computer 
monitors. A plausible strategy is to detach the front part of a monitor case 
using a laser beam cutter. The trajectory of the cutter is approximately a 
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The sub-task of reaching optimal viewing conditions requires a more principled 
treatment which is beyond the scope of this work. 
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rectangle (not exactly a rectangle) which surrounds the front part at a con- 
stant distance, and during this course the beam orientation should be kept 
orthogonal to the relevant surface part. Figure 4.26 shows stereo images of 
a monitor (focal length 12mm) and in more detail the finger-monitor rela- 
tion (focal length 69mm). For this application the control problem is rather 
complicated. Actually, the goal situation is an ordered sequence of interme- 
diate goal situations which must be reached step by step. Along the course 
of moving between intermediate goal situations one must keep or reach a fur- 
ther type of goal situation. This means, the measurement vector describing 
a situation must be partitioned into two subvectors, the first one consisting 
of attributes which should be kept constant and the second one consisting of 
attributes which must change systematically. 




Fig. 4.26. Stereo images of a monitor, and detailed finger-monitor relation. 



For specifying criteria under which the goal situations are reached it is 
advantageous to visually demonstrate these situations in the experimental de- 
signing phase. The control cycles for approaching and assembling an effector 
to a target object are running as long as the deviation between current situa- 
tion and goal situation is larger than a certain threshold. However, the value 
for this parameter must be specified in terms of pixels which is inconvenient 
for system users. Unfortunately, in complicated applications even a vector of 
threshold values must be specified. To simplify this kind of user interaction it 
makes sense to manually arrange certain goal situations prior to the servoing 
cycles and take images. 

These images are analyzed with the purpose of automatically extracting 
the goal situations and furthermore determining relevant thresholds which 
describe acceptable deviations. For example, for servoing the finger we must 
specify in terms of pixels the permissible tolerance for the orthogonality to 
the surface and for the distance from the surface. Actually, these tolerances 
are a priori known in the euclidean 3D space but must be determined in 
the images. Figure 4.27 shows in the first and second image exemplary the 
tolerance concerning orthogonality and distance and in the third and fourth 
image non-acceptable deviations. For determining the acceptable variances 
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in both parameters a simple image subtraction and a detailed analysis of the 
subtraction area is useful. 




Fig. 4.27. Acceptable and non- accept able finger-monitor relations. 



Keep Moving Behavior and Keep Relation Behavior 



The specific sub-task, z. e. moving an effector around the monitor front and 
keeping an orthogonal relation and a certain distance, can be solved by com- 
bining two assembled behaviors. The so-called keep-moving behavior is respon- 
sible for moving the effector through a set of intermediate positions which 
approximate the monitor shape coarsly. The so-called keep-relation behavior 
is responsible for keeping the effector in the desired relation to the current 
part of the surface. The keep-moving behavior strives for moving along an 
exact rectangle but is modified slightly by the keep-relation behavior. For 
the keep-moving behavior four intermediate subgoals are defined which are 
the four corners of the monitor front. 

The head-cameras are used for taking stereo images each of which con- 
taining the whole monitor front and the gripper finger. In both images we 
extract the four (virtual) corner points of the monitor by applying approaches 
of boundary extraction as presented in Chapter 2. By combining the corre- 
sponding 2D coordinates between the stereo images we obtain four 4D vectors 
which represent the intermediate goal positions in the stereo images, i.e. we 
must pass successively four desired measurement vectors Qi,Q 2 tQ 3 iQX- The 
variable state vector S'’{t) is defined as the 3D coordinate vector P{t) of the 
finger tip, and the current measurement vector Q{t) represents its position 
in the stereo images. The pseudo inverse {J^)^ {S'"{t)) of the Jacobian is 
taken from equation (4.13). The control function for approaching the desired 
measurement vectors Q*,i € {1,2, 3, 4}, is as follows. 



C(t) := 



(P)Us''(t))-(Q:-Q(t)) 

||(JDbS’'(i))-(Q*-Q(t))ll 

0 



II Q! - Q(t) II > m 

else 



(4.18) 



In the application phase parameter i is running from 1 to 4, i.e. as soon as 
Q* is passed taking threshold rji into account then the behavior is striving for 
Qi+i- Due to the normalization involved in the control function an increment 
vector of constant length is computed. This makes sense, because in the 
inspection sub-task a camera movement with constant velocity is favourable. 
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The keep-relation behavior is responsible for keeping the finger in an or- 
thogonal orientation near to the current part of the monitor front. For taking 
images from the situations at a high resolution (see Figure 4.27) the hand- 
camera CA^ is used. Similar to the grasping sub-task (treated in Subsection 
4.3.4) a rotational and/or a translational movement takes place if the cur- 
rent situation is non-acceptable. For rotational servoing simply histograms 
of edge orientations can be used to distinguish between acceptable and non- 
acceptable angles between finger and object surface. Coming back to the role 
of visual demonstration it is necessary to acquire three classes of histograms 
prior to the servoing cycles. One class consisting of acceptable relations and 
two other classes representing non-acceptable relations with the distinction 
of clockwise or counter-clockwise deviation from orthogonality. Based on this, 
a certain angle between finger and object surface is classified during the ser- 
voing cycles using its edge orientation histogram. 

For example, a GBF neural network can be used in which a collection 
of hidden nodes represents the three manifolds of histograms and an out- 
put node computes an evidence value indicating the relevant class, e.g. value 
near to 0 for acceptable relations and values near to 1 or -1 for non-acceptable 
clockwise or counter-clockwise deviation. As usual, the hidden nodes are cre- 
ated on the basis of the k-means clustering algorithm and the link weights to 
the output node are determined by the pseudo inverse technique. The control 
function for the rotation task is similar to equation (4.10) with the distinc- 
tion that a measure of distance between current and desired measurement 
vectors {i.e. edge orientation histograms) is computed by the RBF network. 
For translating the finger to reach and then keep a certain distance to the 
monitor a strategy similar to the grasping approaches can be used (see Sub- 
section 4.3.4). 

Task- Specific Module for Continual Handling of an Effector 

The cooperation between the keep-moving behavior and the keep-relation 
behavior is according to the principle of alternation. The keep-moving be- 
havior should approach step by step the four corners of the monitor, i.e. in 
each iteration of its control cycles a small increment towards the next mon- 
itor corner must be determined. Additionally, in each iteration the second 
control cycle of the keep-relation behavior must bring the effector into the 
desired relation to current part of the monitor front. As soon as this rela- 
tion is reached, the next iteration of the keep-moving control cycle comes 
into play, and so on. A generic module of type MB 2 would be appropriate 
for treating two goals in combination. However, concerning the keep-moving 
behavior we have defined four major subgoals, i.e. passing the four corners of 

The strength of applying the learning process to the raw histogram data is that 
the network can generalize from a large amount of data. However, if data com- 
pression would be done prior to learning [e.g. computing symbolic values from 
the histograms) then quantization or generalization errors are unavoidable. 
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the monitor front. In consequence of this, the specific sub-task of inspecting 
or manipulating a monitor front is solved by a sequential combination of the 
relevant generic module MB 2 , and this defines the task-specific module MTy. 

MT-j ■= ( MB 2 & MB 2 & MB 2 & MB 2 ) (4.19) 

4.3.7 Monitoring the Task-Solving Process 

For the exemplary high-level task we have worked out the following sub-tasks 
which must be executed in sequential order: localizing a target object on 
the domestic area (MTi), determining and reconstructing obstacle objects 
{MT 2 ), approaching, grasping, and clearing away obstacle objects on the 
parking area (MTq), approaching, grasping, and carrying the target object to 
the inspection area {MT 4 , MT^), inspecting and/or manipulating the target 
object (MTr). 

Based on the definitions of task-specific modules in the previous subsec- 
tions, we can introduce a task-specific module MTg which is used simply for 
a brief denotation. 

MTs := ( MTi & MT 2 & MTq & 

MTi & MTs & MTr ) (4.20) 

Designing Aspects for Monitoring the Task-Solving Process 

The overall process, implemented by module MTs, rnust be supervised with 
generic monitors of the types MM\, MM 2 , and MM 3 . The time monitor 
MMi must check whether the sub-tasks are solved during the periods of time, 
which are prescribed from the specification of the overall task. It is a matter 
of the experimental designing phase to implement task-specific modules such 
that the time constraints can be met in the application phase normally. The 
situation monitor MM 2 must check, after completion of a certain sub-task, 
whether an environmental situation has been arranged or topical information 
has been contributed which is needed in successive sub-tasks. For example, in 
the case that the target object can not be localized on the domestic area, the 
successive sub-tasks are meaningless and therefore the monitor must interrupt 
the system. 

The exception monitor M Ms must observe the whole environmental scene 
continually during the task-solving process. This is necessary for reacting 
appropriately in case of unexpected events. For example, during the appli- 
cation phase the monitor should detect a situation in which a person or an 
object enters the environment inadmissibly. The problem is to distinguish be- 
tween goal-oriented events and unexpected events. A rudimentary strategy 
may work as follows. The environmental scene is subdivided into a field in 
which the robot arm is working for solving a certain sub-task and the com- 
plementary field. This subdivision is changing for each sub-task, e.g. during 
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the sub-task of localizing a target object the domestic area is occupied. The 
ceiling-camera CAi is used for detecting events in fields in which the robot 
arm is not involved, i.e. an anexpected event must have occurred. 

In the experimental designing phase one must determine for each sub- 
task the specific field in which the robot arm is working. More concretely, 
one must determine for each sub-task the specific image area, in which some- 
thing should happen, and the complementary area, in which nothing should 
happen. Simple difference image techniques can be applied for detecting gray 
value changes, but they must be restricted to the areas which are supposed to 
be constant for the current sub-task. However, if a significant change happens 
in the application phase nevertheless, then an unexpected event must have 
occurred and the monitor may interrupt the system. 

Task-Specific Module for Monitoring the Task-Solving Process 

In a task-specific module MTg the generic monitors of the types MM\, MMg, 
and MM 3 must work simultaneously. 

MTg ■= ( MMi I MM 2 I MM 3 ) (4.21) 

The process of module MTg must be supervised by the process of module 
MT 3 continually. For this simultaneous execution we introduce the final task- 
specific module MTig. 

MTw ■■= ( MTs I MTg ) (4.22) 

4.3.8 Overall Task-Specific Configuration of Modules 

For solving the exemplary high-level task we introduced a series of task- 
specific modules, i.e. MTi, • • • , MTig. They are defined as sequential and/or 
parallel configurations of generic modules taken from the repository (pre- 
sented in Section 4.2). As a summary. Figure 4.28 shows the overall con- 
figuration of modules which defines the autonomous camera-equipped robot 
system for solving the exemplary task. 

According to the vertical organization the generic modules MI\, MIg, 
M/3, MB 3 are based on vector fields from the deliberate layer, and the 
generic modules MBi, M Bg, MB 3 make use of visual feedback at the reac- 
tive layer. Generic module MB 3 integrates deliberate and reactive processing 
(which is also done in generic modules MBg and MBq, presented in Appendix 
2). Also the task-specific modules combine deliberate and reactive process- 
ing, which is obvious due to the sequential and/or parallel combination of 
the generic modules. Two effectors contribute to the exemplary task, i.e. the 

Of course, this strategy is not able to detect unexpected events in the field in 
which the robot arm is working currently. More sophisticated approaches are 
necessary for treating problems like these. 
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Fig. 4.28. Configuration of task-specific modules based on generic modules for 
solving the exemplary high-level task. 



robot arm and the robot head. Therefore, two types of state vectors are in- 
volved which results in a horizontal organization consisting of two columns 
(see Figure 4.6). 

We configured the system under the guiding line of minimalism principles. 
A minimal set a behavioral and instructional modules is involved and each 
one is responsible for executing a specific sub-task. Every module is essential, 
and if one of them fails, then the exemplary task cannot be solved. Three 
types of specific modules are used for supervising the task-solving process, 
z.e. time, situation, and exception monitor. A high-level task must be solved 
completely, z.e. partial solutions which may be caused by module failures 
are not acceptable. To design robust, task-solving systems, which are toler- 
ant against module failures, one must include redundant modules. It is the 
responsibility of the monitors to recognize erroneous module behaviors and 
to bring alternative modules into application. Our work does not treat this 
aspect in more detail. 

In the next section we introduce basic mechanisms for camera-robot co- 
ordination which have not been treated so far. 
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4.4 Basic Mechanisms for Camera— Robot Coordination 



For treating the exemplary task we take advantage of certain constant fea- 
tures which characterize the relations between various cameras and the robot 
arm. For approximating the relevant relations we make use of the agility of the 
robot arm and learn from systematic hand movements. That is, in the experi- 
mental designing phase a set of samples, which consist of corresponding robot 
and image coordinates, is memorized and relevant relations are approximated 
from those. For example, this strategy has been applied for determining the 
relations between various coordinate systems in our double-eye-on-hand sys- 
tem, i.e. robot hand and the hand-cameras CA 2 and CA^. We refer to the 
diploma thesis of Kunze for theoretical foundations and experimental results 
[94].^^ In this section, we concentrate on acquiring the relations involved in 
our double-eye-off-hand system, i.e. robot hand and the head-cameras CA 4 
and CA^. The description also includes the optical axes and the fields of 
view. All of these features are represented relative to the static coordinate 
system of the robot arm and can be changed by the degrees of freedom of 
the head system. For the automatic alignment of the head-camera system, 
including the movement of the robot vehicle to the optimal place, it would 
be necessary to design task-specific modules based on the generic modules 
presented in Section 4.2. However, this is beyond the scope of the chapter 
and therefore we assume that the system designer does the work manually. 



4.4.1 Camera Manipulator Relation for One-Step Control 

The relevant modality of the head-camera-manipulator relation depends on 
specific constraints which are inherent in the characteristics of a task. In this 
subsection we consider the sub-task of moving the robot hand to a certain 
position which is located in the field of view of the head-cameras, but starting 
from a position outside the field of view. For example, in our exemplary task 
we treated the sub-task of carrying an object from the domestic area to the 
parking area, but only the latter area was located in the field of view. For 
such applications visual feedback control is not possible, because the robot 
hand is not visible in the early phase, i.e. we can not extract a current mea- 
surement vector Q{t) from the images. Therefore, the only way is to extract 
the desired measurement vector Q* , reconstruct the 3D position as accurate 
as possible, and move the robot hand in one step to the determined position. 
In this section the reconstruction function is approximated nonlineary by 
GBF networks (see Subsection 3.2.2 for foundations of GBF networks). For 
various configurations we compute the reconstruction errors in order to ob- 
tain a reasonable network structure which would be appropriate for reaching 
a certain degree of accuracy. 
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Relationship between Image and Effector Coordinate Systems 

By taking stereo images with the head-cameras and detecting the target 
place in the two images, we obtain two two-dimensional positions (z. e. two 
2D vectors). The two positions are defined in the coordinate systems of the 
two cameras and are combined in a single vector {i.e. 4D vector). On the other 
hand, the robot hand moves within a 3D working space, which is defined in 
the basis coordinate system of the robot arm. The position of the virtual 
gripper point is a 3D vector which is located in the middle between the finger 
tips of the robot hand. Hence, we need a function for transforming the target 
positions from the image coordinate systems of the cameras to the cartesian 
coordinate system of the robot arm {i.e. transforming 4D vectors into 3D 
vectors). 

Traditionally, this function is based on principles of stereo triangulation 
by taking intrinsic parameters (of the camera) and extrinsic parameters (de- 
scribing the camera-robot relationship) into account [52]. Opposed to that, 
we use GBF networks to learn the mapping from stereo image coordinates 
into coordinates of a robot manipulator. There are three good reasons for this 
approach. First, the intrinsic and extrinsic parameters are unnecessary and 
therefore are not computed explicitly. The coordinate mapping from stereo 
images to the robot manipulator is determined in a direct way without in- 
termediate results. Second, usual approaches of camera calibration assume 
certain camera models which must be known formally in advance, e.g. per- 
spective projection und radial distortion. Instead of that, the learning of 
GBF networks takes place without any a priori model and can approximate 
any continuous projection function. Third, by varying the number and the 
parametrization of the GBFs during the training phase, the accuracy of the 
function approximation can be controlled as desired. For example, a coarse 
approximation would be acceptable (leading to a minimal description length) 
in applications of continual perception-action cycles. 

Acquiring Training Samples by Controlled Effector Movements 

The procedure for determining the camera-robot coordination is as follows. 
We make use of training samples for learning a GBF network. First, the set 
of GBFs is configured, and second, the combination factors of the GBFs are 
computed. We configure the set of GBFs by simply selecting certain elements 
from the whole set of training samples and using the input parts (4D vectors) 
of the selected samples to define the centers of the GBFs. The combination 
factors for the GBFs are computed with the pseudo inverse technique, which 
results in least square errors between pre-specified and computed output val- 
ues. 

The prerequisite for running the learning procedure is the existence of 
training samples. In order to obtain them, we take full advantage of the agility 
of the robot arm. The hand effector moves in the working space systematically 
and stops at equidistant places. For each place we record the 3D position of 
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the virtual gripper point of the robot hand which is equal to the position of 
the place supplied to the control unit of the robot arm, i.t. the tool center 
point. Furthermore, at each stopping place a correlation-based recognition 
algorithm detects the gripper tip in the stereo images (see Figure 4.29) and 
the two two-dimensional positions are combined to a 4D vector. All pairs 
of 4D-3D vectors are used as training samples for the desired camera-robot 
coordination. 




Fig. 4.29. The stereo images show the robot hand with parallel jaw hngers. A 
correlation-based recognition algorithm has been used to localize the virtual gripper 
point. This is illustrated by a white square including two intersecting diagonals. 



The strategy of using the robot arm itself for determining the head- 
camera-manipulator relation is advantageous in several aspects. First, we 
don’t need an artificial calibration object. Second, samples can be taken both 
from the surface and within the working space. Third, the number of samples 
for approximating the function is variable due to steerable distances between 
the stopping places. Fourth, the head-camera-manipulator relation is com- 
puted relative to the basis coordinate system of the robot arm directly, which 
is the relevant coordinate system for controlling the robot hand. 

Experiments to the Estimation of the Camera Robot Relationship 

Based on image coordinates of the virtual gripper point, the GBF network 
has to estimate its 3D position in the basis coordinate system of the robot 
arm. On average the 3D position error should be as low as possible. The 
main question of interest is, how many GBFs and which extents are needed 
to obtain a certain quality for the camera-robot coordination. In order to 
answer this question, four experiments have been carried out. In the first and 
second experiment, we applied two different numbers of GBFs exemplary. The 
third experiment shows the effect of doubling the image resolution. Finally, 
the fourth experiment takes special care for training the combination weights 
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of the GBFs. In all four experiments, we systematic increase the GBF extent 
and evaluate the mean position error. 

We take training samples for each experiment. The working space of the 
robot hand is cube-shaped of 300mm side length. The GBFs are spread over 
a sup-space of 4D vectors in correspondence to certain stopping places of the 
robot hand. That is, the 4D image coordinates (resulting from the virtual 
gripper point at a certain stopping place) are used for defining the center of 
a Gaussian basis function. The following experiments differ with regard to the 
size and the usage of the training samples. The application of the resulting 
GBF networks is based on testing samples which consist of input-output pairs 
from the same working space as above. For generating the testing samples the 
robot hand moves in discrete steps of 20mm and it is assured that training 
and testing samples differ for the most part, i.e. have only a small number 
of elements in common. 

In the first experiment, the robot hand moved in discrete steps of 50mm 
through the working space which result in 7 x 7 x 7 = 343 training samples. 
Every second sample is used for defining a GBF (4 x 4 x 4 = 64 GBFs) and 
all training samples for computing the combination weights of the GBFs. 
The image resolution is set to 256 x 256 pixel. Figure 4.30 shows in curve 
(a) the course of mean position error (of the virtual gripper point) for a 
systematic increase of the Gaussian extent. As the GBFs become more and 
more overlapped the function approximation improves, and the mean position 
error decreases to a value of about 2.2mm. 

The second experiment differs from the first in that the robot hand moved 
in steps of 25mm, i.e. 13 x 13 x 13 = 2197 training samples. All samples are 
used for computing the GBF weights, and every second sample for defining a 
GBF (7 x 7 x 7 = 343 GBFs). Figure 4.30 shows in curve (b) that the mean 
position error converges to 1.3mm. 

In the third experiment the same configuration has been used as before, 
but the image resolution was doubled to 512 x 512 pixels. The accuracy of 
localizing the finger tips in the images increases, and hence the mean position 
error of the virtual gripper point reduces once again. Figure 4.30 shows in 
curve (c) the convergence to error value 1.0mm. 

The fourth experiment takes special care of both the training of weights 
and the testing of the resulting GBF network. Obviously, there is only a one- 
sided overlap between GBFs at the border of the working space. Hence, the 
quality of the function approximation can be improved if a specific sub-set 
of 3D vectors, which is located at the border of the working space, will not 
be taken into account. In this experiment, the 343 GBFs are spread over 
the original working space as before, but an inner working space of 250mm 
side length is used for computing combination factors and for testing the 
GBF network. Figure 4.30 shows in curve (d) that the mean position error 
decreases to a value of 0.5mm. 
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Fig. 4.30. The curves show the mean position error versus the extents of GBFs 
under four different conditions, (a) Small GBF number, low image resolution, (b) 
Large GBF number, low image resolution, (c) Large GBF number, high image 
resolution, (d) Experiment (c) and avoiding approximation errors at working space 
border. Generally, the error decreases by increasing the Gaussian extent, and the 
larger the GBF number or the higher the image resolution the smaller the position 
error. 



Conclusions from the Experiments 

Based on these experiments, we configure the GBF network such that a de- 
sired accuracy for 3D positions can be reached {e.g. ±0.5mm). During the 
application phase, first the target place must be detected in the stereo im- 
ages. Second, the two 2D coordinate vectors are put into the GBF network 
for computing a 3D position. Finally, the robot hand will move to that 3D 
position which is approximately the position of the target place. 

Although the obtainable accuracy is impressive this approach can not 
react on unexpected events during the movement of the robot hand, because 
continual visual feedback is not involved. However, there are applications 
in which unexpected events are excluded, and for these cases the approach 
would be favourable. 
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4.4.2 Camera Manipulator Relation for Multi-step Control 

In applications such as putting down an object on a physical target place 
{e.g. on the parking or inspection area) or steering the hand effector over a 
large object {e.g. monitor front case), the current position of the effector is 
in close spatial neighborhood to the target place. In order to avoid undesired 
events such as collisions, one must be careful and move the robot hand just 
in small steps under continual visual feedback, i.e. applying a multi-step con- 
troller. Due to the close spatial neighborhood, it is possible to take relevant 
images such that both the current and the desired measurement vectors can 
be extracted. This is the precondition for applying procedures of image-based 
effector servoing, as worked out in Subsection 4.2.1. 

A basic constituent of the servoing procedure is a description of the rela- 
tionship between displacements of the robot hand and the resulting displace- 
ments in the image of a head-camera. A usual approach is to approximate the 
projection function, which transforms 3D robot coordinates into 2D image 
coordinates, and specify the Jacobian matrix. The Jacobians have already 
been applied in Subsections 4.3.4, 4.3.5, and 4.3.6. In this subsection we 
specify two variants of projection functions, i.e. a linear and a nonlinear ap- 
proximation, and determine the Jacobians thereof. The training samples of 
corresponding 3D points and 2D points are determined according to the ap- 
proach mentioned in the previous subsection, i.e. tracking controlled gripper 
movements. 



Jacobian for a Linear Approximation of the Projection Function 

A perspective projection matrix Z lineary approximates the relation between 
the manipulator coordinate system and the image coordinate system of a 
head-camera. 

( \ '■= {Zll, Z\2, Z\^, Zii) 

Z := ; with Z^ := {z 2 i, Z 22 , Z 23 , Z 24 ) (4.23) 

\ ^3 ) ^3 ■= ( 231 ) 232 ,^ 33 ) 234 ) 



The usage of the projection matrix is specified within the following 
context. Given a point in homogeneous manipulator coordinates P := 
{X,Y, Z,iy , the position in homogeneous image coordinates p := {x,y,l)'^ 
can be obtained by solving 

/ IHP) \ 1 

p := f\P) := f!f{P) \ -.= - ZP ■ with e := Z 3 " • P (4.24) 

V fi\P) ) ^ 

According to this, the matrix Z is determined with simple linear methods 
by taking the training samples of corresponding 3D points and 2D points into 
account [52, pp. 55-58]. The scalar parameters Zij represent a combination 
of extrinsic and intrinsic camera parameters which we leave implicit. The 
specific definition of the normalizing factor ^ in equation (4.24) guarantees 
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that function f^{P) is constant 1 , i.e. the homogeneous image coordinates 
of position p are given in normalized form. 

Next, we describe how a certain change in manipulator coordinates affects 
a change in image coordinates. The Jacobian for the transformation 
in equation (4.24) is computed as follows. 



J/(P) := 












^(p) ^(p) 



(z--py(zip) 



(4.25) 



These computations must be executed for both head-cameras CA 2 and 

(7 A 3 which result in two perspective projection matrices Z 2 and Z 3 and two 
f f 

Jacobi ans J 2 and J 3 . 



Jacobian for a Nonlinear Approximation of the Projection Function 



Instead of using a projection matrix we can also take a GBF network for 
approximating (nonlinear) the relation between the manipulator coordinate 
system and the image coordinate system of a head-camera. The definitions of 
the functions /(* and / 2 * in equation (4.24) must be redefined as a weighted 
sum of Gaussians, respectively. 

I 

ff(P) := • /P(^) ; J e { 1 , 2 } (4.26) 

with f^%P) := exp 

These equations represent a GBF network with a two-dimensional output. 
The centers Pi and extents ai, for i G (I,- ••,/}, are obtained by usual 
approaches of GBF network learning. 

The Jacobian for the redefined transformation /”^ is as follows. 



jf(P) := 



^(P) 



d/r 

dX 



(P) 



^tiP) 
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dZ 
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(4.27) 



These computations must be executed for both head-cameras CA 2 and 
CA 3 which result in two GBF networks and two Jacobians j( and J 3 . For 
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determining appropriate translations of the robot hand we must combine the 
two Jacobians, compute the pseudo inverse, and apply the resulting matrix 
to the difference vector between desired and current measurement vectors in 
the images. This procedure has already been applied in Subsection 4.3.5 (see 
equations (4.12), (4.13), (4.14), and (4.15)). 

Dealing with Inaccurate Head- Camera Manipulator Relations 

In Subsection 4.4.1, we showed exemplary that GBF networks can approx- 
imate the transformation between coordinates in stereo images and coordi- 
nates in the robot arm up to an impressive degree of accuracy. However, 
the accuracy decreases considerable if the relation between head-camera and 
robot arm will be changed physically by accident. Instead of approximating 
the relation again, we can alternatively work with the inaccurate approxi- 
mation of head-camera-manipulator relation and make use of the servoing 
mechanism. The following experiments are executed in order to confirm this 
strategy. 

Servoing Experiments under Inaccurate Head-Camera Manipula- 
tor Relations 

The spatial distance between the center of the head-camera system and the 
center of the robot arm is about 1500mm, the focal length of the two head- 
cameras has been steered to 12mm, respectively. The working space of the 
robot hand is a cube of sidelength 400mm. Projection matrices are used for 
approximating the projection function between robot and image coordinates. 
Three different approximations are considered which are based on different 
densities of the training samples. Concretely, for the first approximation the 
stopping places of the robot hand were at distances of 100mm which yielded 
125 training samples, for the second approximation the hand stopped every 
200mm which yielded 27 training samples, and for the third approximation 
the hand stopped every 400mm {i.e. only at the corners of the working space) 
which yielded 8 training samples. For each approximation we compute two 
Jacobians, respectively for each head-camera and combine them according to 
equation (4.13). In all experiments the gripper must start at a corner and is 
supposed to be servoed to the center of working space by applying the control 
function in equation (4.15). 

For a servoing factor s := 0.5 it turns out that at most 10 cycle iterations 
are necessary until convergence. After convergence we make measurements of 
the deviation from the 3D center point of the working space. First, the ser- 
voing procedure is applied under the use of the three mentioned projection 
matrices. The result is that the final deviation from the goal position is at 
most 5mm with no direct correlation to the density of the training samples 
(i.e., the accuracy of the initial approximation). According to this, it is suf- 
ficient to use just eight corners of the working space for the approximation 
of the head-camera-manipulator relation. Second, the servoing procedure is 
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applied after changing certain degrees of freedom of the robot head, respec- 
tively, and thus simulating various accidents. Changing the head position in 
a circle of radius lOOmm, or changing pan or tilt DOF within angle interval 
of 10° yield deviations from goal position of at most 25mm. The errors occur 
mainly due to the restricted image resolution of 256 x 256 pixels. According to 
these results, we can conclude that a multi-step control procedure is able to 
deal with rough approximations and accidental changes of the head-camera- 
manipulator relation. 

Handling Inaccurate Relations by Servoing Mechanisms 

The experiments proved, that image-based effector servoing plays a funda- 
mental role in the application phase of the process of solving a high-level, 
deliberate task (see Section 4.3). In addition to this, the servoing mecha- 
nism can also support the experimental designing phase which precedes the 
application phase. For example, for certain tasks of active vision one must de- 
termine additional features of the camera and/or the relation between robot 
and camera, i.e. in addition to the coordinate transformations treated in Sub- 
sections 4.4.1 and 4.4.2. Specifically, these additional features may comprise 
the optical axis and the field of view of the head-cameras relative to the basis 
coordinate system of the robot arm. 

4.4.3 Hand Servoing for Determining the Optical Axis 

The optical axis plays a fundamental role for supporting various techniques 
of image processing. For example, the robot arm may carry an object into the 
field of view of a head-camera, then approach the object along the optical axis 
to the camera, and finally inspect the object in detail. Both sub-tasks of 
approaching the object to the camera and detailed object inspection can be 
controlled, respectively, carried out by techniques of image processing which 
are concentrated in an area around the image center. 

Servoing Strategy for Estimating the Optical Camera Axis 

For estimating the optical axis of a head-camera relative to the basis co- 
ordinate system of the robot arm we present a strategy which is based on 
image-based hand effector servoing. The virtual gripper point of the robot 
hand is servoed to two distinct points located on the optical axis. It is as- 
sumed that all points located on this axis are projected to the image center 
approximately. Accordingly, we must servo the robot hand such that the 
two-dimensional projection of the virtual gripper point approaches the im- 
age center. In the goal situation the 3D position of the virtual gripper point 

The object resolution in the images can also be increased by changing the focal 
length of the head-camera lens (see Subsection 4.3.6). Therefore, a cooperative 
work of changing the external DOF of the robot arm and the internal DOF of 
the head-camera may reveal optimal viewing conditions. 
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(which is the known tool center point in the (X,Y,Z) manipulator coordi- 
nate system) is taken as a point on the optical axis. Two planes are specified 
which are parallel to the (IF, Z) plane with constant offsets Xi and X 2 on 
the l¥-axis. The movement of the virtual gripper point is restricted just to 
these planes (see Figure 4.31). 




Fig. 4.31. Determining the optical axis of a head-camera. 



In image-based effector servoing the deviation between a current situation 
and a goal situation is specified in image coordinates. In order to transform a 
desired change from image coordinates back to manipulator coordinates the 
inverse or pseudo inverse of the Jacobian is computed. For this sub-task the 
generic definition of the Jacobian jf (according to equation (4.25) or equa- 
tion (4.27)) can be restricted to the second and third columns, because the 
coordinates on the X-axis are fixed. Therefore, the inverse of the quadratic 
Jacobian matrix is computed, i.e. (P) ■= (J^) ^ (P)- 

Control Function for the Servoing Mechanism 

The current measurement vector Q{t) is defined as the 2D image location 
of the virtual gripper point and the desired measurement vector Q* as the 
image center point. The variable state vector S'"{t) consists of the two vari- 
able coordinates of the tool center point in the selected plane (Xi, Y, Z) or 
{X 2 ,Y , Z). With these redefinitions of the Jacobian we can apply the con- 
trol function which has already been presented in equation (4.15). The hand 
position is changed by a non- null vector C(t) if the desired and the current 
position in the image deviate more than a threshold rji. According to our 
strategy, first the virtual gripper point is servoed to the intersection point 
Pi of the unknown optical axis with the plane (Xi, Y, Z), and second to the 
intersection point P 2 with plane {X 2 ,Y , Z). The two resulting positions of 
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the virtual gripper point specify the axis which is represented in the basis co- 
ordinate system of the robot arm. Figure 4.32 shows for the hand servoing on 
one plane the succession of the virtual gripper point extracted in the image, 
and the final point is located at the image center (servoing factor s := 0.3). 



\ 



Fig. 4.32. Course of detected virtual gripper point in the image. 



4.4.4 Determining the Field of Sharp View 

Image-based hand effector servoing is also a means for constructing the field 
of sharp view of a head-camera which can be approximated as a truncated 
pyramid with top and bottom rectangles normal to the optical axis (see Fig- 
ure 4.33). The top rectangle is small and near to the camera, the bottom 
rectangle is larger and at a greater distance from the camera. 

Servoing Strategy for Determining the Depth Range of Sharp View 

For determining the depth range of sharp view the virtual gripper point is 
servoed along the optical axis and the sharpness of the depicted finger tips 
is evaluated. As the finger tips are located within an area around the image 
center, we can extract a relevant rectangular patch easily and compute the 
sharpness in it (see Subsection 4.3.6 for a possible definition of sharpness 
measurements). Figure 4.34 shows these measurements for a head-camera 
with focal length 69mm. The robot hand starts at a distance of 1030mm to 
the camera and approaches to 610mm with stopping places every 30mm (this 
gives 15 measurements). We specify a threshold value Q* for the measure- 
ments Q{t) for defining the acceptable level of sharpness. In Figure 4.34 four 
measurements surpass the threshold, i.e. numbers 8,9,10,11, which means 
that the depth range of sharpness is about 90mm, reaching from 700mm to 
790mm distances from the camera. 
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Fig. 4.33. Pyramid field of view and sharpness. 



0 




Index of distance 

Fig. 4.34. Sharpness measurements in the image section containing the finger tips; 
course for approaching the fingers along the optical axis of a head-camera. 



Control Functions for the Servoing Mechanism 

The control procedure consists of two stages, first reaching the sharp field, 
and second moving through it. 
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C^{t) := 
C2{t) := 



s : {Q* - Q{t)) > 0 

0 : else 

s : {Q* - Q(t)) < 0 

0 : else 



(4.28) 

(4.29) 



The variable state vector S^(t) is just a scalar defining the position of 
the virtual gripper point on the optical axis and the control vector C(t) is 
constant scalar (e.g. s := 30mm). As a result of this procedure, we obtain 
the top and bottom point on the optical axis which characterize the depth 
range of sharpness. The width and height of the field of sharp view must be 
determined at these two points which are incident to the top and bottom rect- 
angle of the truncated pyramid. Once again, the agility of the manipulator 
comes into play to determine the rectangle corners. First, the virtual grip- 
per point must servoed on the top plane and, second, on the bottom plane. 
Sequentially, the gripper should reach those four 3D positions for which the 
virtual gripper point is projected onto one of the image corners. The control 
schema is equal to the one for determining the optical axis with redefined mea- 
surement vectors and control vectors. By repeating the procedure for both 
planes, we obtain the eight corners of the truncated pyramid. For example, 
using quadratic images from the our head-camera (focal length 69mm) the 
sidelength of the top rectangle is 80mm and of the bottom rectangle 90mm. 

We presented an approach for determining the field of view of a camera 
based on the optical axis. Several other applications of optical axes are con- 
ceivable. For example, the head position can be determined relative to the 
coordinate system of the robot arm; by changing the pan and/or tilt DOF of 
the robot head into two or more discrete states, determining the optical axis, 
respectively, and intersecting the axes. Although it is interesting, we do not 
treat this further. 



The final Section 4.5 summarizes the chapter. 



4.5 Summary and Discussion of the Chapter 

This chapter presented basic mechanisms and generic modules which are rel- 
evant for designing autonomous camera-equipped robot systems. For solving 
a specific high-level task we performed the designing phase and thus we de- 
veloped an exemplary robot vision system. In particular, we did experiments 
in the task-relevant environment to construct image operators and adap- 
tive procedures, and thus implemented task-specific modules and combina- 
tions thereof. Intentionally, a simple high-level task has been chosen in order 
to clearly present the principles of developing autonomous camera-equipped 
robot systems. By relying on a repertoire of mechanisms and modules one 
can develop more sophisticated robot systems for solving more complicated 
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high-level tasks. Generally, the designing/developing phase must treat the 
following aspects: 

• Determine the way of decomposing the high-level task into sub-tasks, 

• determine whether to solve sub-tasks with instructional or behavioral mod- 
ules, 

• determine how to integrate/combine deliberate plans and visual feedback 
control, 

• determine how to generate deliberate plans from visual input, 

• determine which type of controller is useful for specific sub-tasks, 

• determine the consequences of the specific hardware in use (e.g. type of 
camera objectives, type of gripper fingers) with regard to image analysis 
and effector tracking, 

• determine strategies for arranging cameras such that image processing is 
simplified {e.g. degrees of compatibility, simplicity of appearance mani- 
folds), 

• determine strategies of moving cameras and taking images such that the 
task-relevant information can be extracted at all, 

• verify the appropriateness of image processing techniques along with spe- 
cific parameters, 

• construct adaptive procedures such that image processing parameters can 
be fine-tuned automatically, if needed, 

• learn operators for object and/or situation recognition, 

• construct goal-directed visual servoing procedures along with specific pa- 
rameters, 

• construct exploration strategies which are based on immediate rewards or 
punishments instead of explicit goals, 

• determine camera-robot relations depending on the intended strategy of 
using them, 

• determine a reasonable time-schedule which must be kept for executing the 
sub-tasks, 

• determine a strategy for reacting appropriately on unexpected events, 

• etc. 

As a result of the designing/developing phase, it is expected to obtain a 
configuration of appropriate task-specific modules for treating and solving a 
high-level task during the application phase, autonomously. Despite of the 
indisputable advantages of the bottom-up designing methodology the task- 
solving process may fail nevertheless, especially in case of completely new 
aspects occuring in the application phase. In consequence of that, the de- 
signing/developing phase and the application phase must be integrated more 
thoroughly with the main purpose that learning should take place during 
the whole process of task-solving and should not be restricted to the design- 
ing/developing phase. 




5. Summary and Discussion 



We presented a paradigm of developing camera-equipped robot systems for 
high-level Robot Vision tasks. The final chapter summarizes and discusses 
the work. 



5.1 Developing Camera-Equipped Robot Systems 

Learning-Based Design and Development of Robot Systems 

There are no generally accepted methodologies for building embedded sys- 
tems. Specifically, autonomous robot systems can not be constructed on the 
basis of pre-specified world models, because they are hardly available due 
to imponderables of the environmental world. For designing and develop- 
ing autonomous camera-equipped robot systems we propose a learning-based 
methodology. One must demonstrate relevant objects, critical situations, and 
purposive situation-action pairs in an experimental phase prior to the appli- 
cation phase. Various learning machines are responsible for acquiring image 
operators and mechanisms of visual feedback control based on supervised ex- 
periments in the task-relevant, real environment. This supervisory nature of 
the experimental phase is essential for treating high-level tasks in the appli- 
cation phase adequately, i.e. the behaviors of the application phase should 
meet requirements like task-relevance, robustness, and time limitation simul- 
taneously. 



Autonomous camera-equipped robot systems must be designed and 
developed with learning techniques for exploiting task-relevant expe- 
riences in the real environment. 



Learning Feature Compatibilities and Manifolds 

Some well-established scientists of the autonomous robotics community ex- 
pressed doubts versus this central role of learning. They argued that tremen- 
dous numbers of training samples and learning cycles are needed and that 
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the overgeneralization/overfitting dilemma makes it difficult to acquire re- 
ally useful knowledge. However, in this work we introduced some principles 
which may overcome those typical learning problems to a certain extent. For 
learning image operators we distinguished clearly between compatibilities and 
manifolds. A compatibility is related to a feature which is more or less stable 
under certain processes, e.g. certain proprioceptive features of the robot ac- 
tuator system or the camera system, certain inherent features of rigid scene 
objects and their relationships. Such features help to recognize certain facets 
of the task-solving process again and again despite of other changes, e.g. 
recognizing the target position during a robot navigation process. Comple- 
mentary to this, manifolds are significant variations of image features which 
originate under the task-relevant change of the spatial relation between robot 
effector, cameras, and/or environmental objects. The potential of manifolds 
is to represent systematic changes of features which are the foundation for 
steering and monitoring the progress of a task-solving process. For solving a 
high-level robotic task both types of features are needed and are applied in 
a complementary fashion. 



The matter of systematic experimentations in the real environment is 
to determine realistic variations of certain features, i.e. learn feature 
compatibilities and manifolds. 



Balance between Compatibilities and Manifolds 

Mathematically, a compatibility is represented by the mean and the variance 
of the feature, or in case of a feature vector by the mean vector and the co- 
variance matrix. The underlying assumption is a multi-dimensional Gaussian 
distribution of the deviations from the mean. In particular, the description 
length of a compatibility is small. Instead of that, a manifold is much more 
complicated, and therefore, the necessary approximation function goes be- 
yond a simple Gaussian. In order to reduce the effort of application during 
the online phase, the complexity of the manifold should be just as high as 
necessary for solving the task. Gertain compatibilities can be incorporated 
with the potential of reducing the complexity of related manifolds, e.g. by ap- 
plying Log-polar transformation to images of a rotating camera one reduces 
the manifold complexity of the resulting image features. The combination 
of features, which will be involved in a task-solving process, is determined 
on the basis of evaluations in the experimentation phase. A compromise is 
needed for conflicting requirements such as task-relevance, robustness, and 
limited period of time. This compromise is obtained by a balance between 
degrees of compatibilities on one one hand and complexities of manifolds on 
the other hand for the two types of features. 
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For solving a high-level robotic task, the experimentation phase must 
result in a balance between two types of features, i.e. approximate 
constant features (compatibilities) versus systematic changing fea- 
tures (manifolds). 



Exploiting Degrees of Freedom for Solving Tasks 

A high-level robotic task leaves some degrees of freedom or even accepts 
different strategies for a solution, e.g. different trajectories of robotic object 
grasping, or different camera poses for monitoring a manipulation task. The 
latter example belongs to the paradigm of Active Vision. The degrees of 
freedom must be exploited with the purpose of reducing the complexities of 
features compatibilities and manifolds. The process of system designing is 
driven to a certain extent by minimal degrees of compatibilities, which are 
needed for the applicability of a priori image operators, and also driven by 
maximal complexities of manifolds, which are accepted for applying learned 
image operators within time limits. In the experimentation phase it is up to 
the system designer to determine relations between the camera system and 
other environmental constituents, such that certain complexities and balances 
of the compatibilities and manifolds are expected to hold. In the subsequent 
application phase a first sub-task is to steer the robot actuators with the 
purpose of automatically arranging relations which are consistent with the 
pre-specified relations as considered in the experimentation phase. 



Camera steering modules are responsible for arranging camera- 
environment relations such that certain image operators will be ap- 
plicable. 



Consensus with the Paradigm of Purposive, Animate Vision 

Our approach of developing autonomous camera-equipped robot systems is 
in consensus with the paradigm of Purposive Vision [2]. The compatibilities 
introduced in chapter 2, the learning mechanisms proposed in chapter 3, 
and the generic modules presented in chapter 4 are applicable to a large 
spectrum of high-level tasks. The learned compatibilities and manifolds and 
the generic modules approximate general assumptions underlying the process 
of embedding the Robot Vision system in the environment. Based on that, 
an autonomous system can be developed and adapted for an exemplary high- 
level task. A specific task is solved under general assumptions and by applying 
and adapting general mechanisms. By incorporating learning mechanisms this 
methodology goes far beyond the paradigm of Purposive Vision leading to 
Animate Vision. It has been recognized in the paradigm of Animate Vision 
that learning is necessary to compensate for the world’s unpredictability [12]. 
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The development of autonomous, camera-equipped robot systems fol- 
lows the paradigm of purposive, animate vision. 



General Systems versus General Development Tools 

In Subsection 1.2.1 we have argued that General Problem Solvers or, in par- 
ticular, General Vision Systems will never be available. Despite of that, our 
approach of boundary extraction is based on general compatibilities, the ap- 
proach of object recognition is based on manifolds approximating general 
views, and our approach of robot control is based on a small set of generic 
modules. At first glance, the discussion in the introductory chapter seems 
to be in contradiction with the substantial chapters of the work. However, 
a detailed insight reveals that we do not strive for a General Vision System 
but strive for general development tools which mainly consist of statistical 
learning mechanisms. For example, the statistical evaluation of techniques 
for boundary extraction, the neural network learning of recognition functions, 
and the dynamic combination of deliberate and/or reactive vector fields serve 
as general development tools. Based on those, compatibilities, manifolds, and 
perception-action cycles can be learned for solving an exemplary robot vision 
task in the relevant environment. From an abstract point of view, the pur- 
pose of the development tools is to discover features which are more or less 
constant and features which change more or less continuous under the task- 
solving process. Additionally, covariances and courses of the two categories of 
features must be approximated which result in compatibilities and manifolds, 
respectively. 



Striving for general robot vision systems is hopeless and ridiculous, 
but striving for and applying general, learning-based development 
tools leads to really useful systems. 



5.2 Rationale for the Contents of This Work 

Work Does Not Present a Gompletely Working System 

The primary purpose of this work is not to describe a working robot system 
which might has been developed for solving a specific Robot Vision task 
or a category of similar tasks. ^ Instead of that, the purpose has been to 
present a general methodology of designing and developing camera-equipped 

^ For persons interested in such systems I kindly recommend to visit my 
home page in the World Wide Web: http://www.ks.informatik.uni-kiel.de/ 
~jpa/research.html . 
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robot systems. We focused on three essential issues, i.e. visual attention and 
boundary extraction in chapter 2, appearance-based recogniton of objects 
or situations in chapter 3, and perception-action cycles based on dynamic 
fields in chapter 4. The approaches were chosen exemplary, however, the 
main purpose has been to make clear the central role of learning. Therefore, 
the chapters are loosely coupled in order to facilitate the change of certain 
approaches by others which might be more appropriate for certain Robot 
Vision tasks. 

Work Hardly Includes Benchmarking Tests 

Generally, the work did not compare the presented approaches with other 
approaches (which is usually done with benchmarking tests). It was not our 
intention to develop new image processing techniques or robot steering strate- 
gies which might surpass a wide spectrum of already existing approaches. 
Instead of that, our intention has been to present a general methodology for 
automatically developing procedures which are acceptable for solving the un- 
derlying tasks. In the work, the only exception concerns Section 3.4 in which 
our favorite approach of object recognition has been compared with nearest 
neighbor classification. However, the reason was to clarify the important role 
of including a generalization bias in the process of learning. 

Work Does Not Apply Specific, Vision-Related Models 

The presented approaches for boundary extraction, object recognition, and 
robot control avoid the use of specific, vision-related models which might be 
derivable from specific, a priori knowledge. Rather, the intention has been 
to learn and to make use of general principles and thus to concentrate on 
minimalism principles. Based on that, it was interesting to determine the 
goodness of available solutions for certain Robot Vision tasks. Of course, 
there are numerous tasks to be performed in more or less customized envi- 
ronments which might be grounded on specific, vision-related models. The 
optical inspection of manufactured electronic devices is a typical example 
for this category of tasks. Nevertheless, in our opinion learning is necessary 
in any application to obtain a robust system. Therefore, our learning-based 
methodology should be revisited with the intention of including and making 
use of specific models, if available. 

Work Is Written More or Less Informally 

Formula are only introduced when they are necessary for understanding the 
approaches. Intentionally, we avoided an overload of the work with too much 
formalism. The universal use of formula would be meaningful if formal proofs 
for the correctness of techniques can be provided. However, the focus of this 
work is on applying learning mechanisms to Robot Vision. Unfortunately, 
it is an open issue of current research to prove the correctness of learning 
mechanisms or learned approximations. 
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5.3 Proposals for Future Research Topics 

Infrastructure for the Experimentation Phase 

The infrastructure for facilitating the experimentation phase must be im- 
proved. A technical equipment together with a comfortable user interface 
is needed for systematically changing environmental conditions in order to 
generate relevant variations of the input data and provide relevant training 
and test scenarios. Techniques are appreciated for propagating covariance 
from input data through a sequence of image processing algorithms and fi- 
nally approximate the covariance of the output data [74]. Finally, we need 
a common platform for uniformly interchanging techniques which might be 
available from different research institutes throughout the world. For exam- 
ple, the Common Object Request Broker Architecture [121] could serve as a 
distributed platform to uniformly access a set of corner detectors, evaluate 
them, and finally choose the best one for the underlying task. 

Fusion of Experimentation and Application Phase 

Future work must deal with the problem of integrating the experimentation 
and the application phase thoroughly. Maybe, the designing phase and the 
application phase should be organized as a cycle, in which several iterations 
are executed on demand. In this case, the usability of certain task-solving 
strategies must be assessed by the system itself in order to execute a partial re- 
design automatically. However, scepticism is reasonable concerning the claim 
for an automatic design, because the evolutionary designing phase of the 
human brain did take billions of years to reach a deliberate level of solving 
high-level tasks. 

Matter of Experimentations 

In my opinion, it is more promising to extend the matter the experimental 
designing phase is dealing with. Most importantly, the system designer should 
make experiments with different learning approaches in the task-relevant en- 
vironment in order to find out the most favorable ones. Based on this, the 
outcome of the experimental designing phase is not only a set of learned 
operators for object/situation recognition and a set of learned visual feed- 
back mechanisms but the outcome could also be a learning approach itself, 
which is intended to learn operators for recognition and to adapt feedback 
mechanisms online. 

Representational Integration of Perception and Action 

In camera-equipped systems the robots can be used for two alternative pur- 
poses leading to a robot- supported vision system (robot- for-vision tasks) or to 
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a vision- supported robot system (vision-for-robot tasks). All techniques pre- 
sented in our work are influenced by this confluence of perception and ac- 
tion. However, a more tight coupling is conveivable which might lead to an 
integrated representational space consisting of perceptive features and motor 
steering parameters. Consequently, the task-solving process would be deter- 
mined and represented as a trajectory in the combined perception-action 
space. 

Neural Network Learning for Robot Vision 

Our work demonstrates exemplary the usefulness of Neural Networks for de- 
veloping camera-equipped robot systems. Generally, the spectrum of possible 
applications of Neural Networks in the field of Robot Vision is far from being 
recognized. Future work on Robot Vision should increase the role to learning 
mechanisms both during the development and the application phase. 




Appendix 1: Ellipsoidal Interpolation 



Let := {Xi, ■ ■ ■ , Xj} be a set of vectors, which are taken from the m- 
dimensional vector space over the real numbers R. Based on the mean vector 
X'^ of 17 we compute the matrix Xi := {Xi — X'^, ■ ■ ■ ,Xj — X'^). The co- 
variance matrix C of 17 is determined by the equation C = j ■ Xi ■ X4^. 
For matrix C we obtain by principal component analysis the eigenvectors 
El and the corresponding eigenvalues A/, I G {1, •••,/}. Let us assume to 
have the series of eigenvalues in decreasing order. Based on that the specific 
eigenvalue A/ is equal to 0 and therefore eigenvector ej can be cancelled (for 
an explanation, see Section 3.2). We define a canonical coordinate system 
(short, canonical frame) with the vector X‘^ as the origin and the eigenvec- 
tors El, - ■ ■ , Ei-i as the coordinate axes. The representation of vectors Xi, 
i € {1, ■ ■ ■ , I}, relative to the canonical frame is obtained by Karhunen-Loeve 
expansion according to equation (3.18). This yields a set of /— 1-dimensional 
vectors 17 := {Xi,---,Xi}. In the canonical frame we define an {I — 1)- 
dimensional normal hyper-ellipsoid, i.e. principal axes are collinear with the 
axes of the canonical frame and the ellipsoid center is located at the origin 
of the frame. The half-lengths of this normal hyper-ellipsoid are taken as 
Ki := i/{I -1)-Xi,le{l,---,I-1}. 

Theorem 1 All vectors Xi, z G {1, • • • , /} are located on the specified hyper- 
ellipsoid. 

Proof There are several {I —l)-dimensional hyper-ellipsoids which interpolate 
the set 17 of vectors, respectively. Principal component analysis determines 
the principal axes E\, - ■ ■ , Ej-i of a specific hyper-ellipsoid which is subject 
to maximization of projected variances along candidate axes. Therefore, all 
corresponding vectors in 17, which are represented in the canonical frame, 
are located on a normal hyper-ellipsoid with constant Mahalanobis distance 
h form the origin. With the given definition for the half-lengths we can show 
that h is equal to 1, which proves the theorem. 

Let the vectors in 17 be defined as Xi := ■ ■ ■ , Xij-iY' , i G {1, • • • , /}. 

From the vectors in 17 the variance vi along axis Ei, I G {1, •••,/— 1} is given 
by vi := j • {xf i-\- • • • -\- x'j i) . The variances vi are equal to the eigenvalues A;. 

For each vector Xi we have the equation -I- • • • H — = h, because the 
vectors are located on a normal hyper-ellipsoid. Replacing nf in the equation 
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by the expression /j yields the following equation ■ 

( !-••• + ^ ) = h. Summing up all these equations 

for i G {1, • • • , J} yields the equation •(/—!) = I ■ h, which results in 
h=l. 



q.e.d. 




Appendix 2: Further Behavioral Modules 



In the following we present the behavioral modules MB4, MB^, and MBq. 
They are more sophisticated compared to MBi, MB2, and MB3, which have 
been introduced in Subsection 4.2.2. 

Generic Module MB4, for Assembled Behavior 

The module is similar to the generic module MB2 in that the state of the 
effector must be changed continually under the constraint of keeping desired 
image measurements. However, instead of combining two goal-oriented cy- 
cles, one replaces the outer cycle by an exploration strategy, i.e. the module 
is intended to solve an exploration sub-task. Only one type of image measure- 
ments is needed, and the robot effector keeps on changing its variable state 
while trying to keep desired measurements {e.g. wandering along a wall). 
For this purpose, the effector continually changes its variable state until a 
desired measurement in the images is obtained {e.g. proximity to the wall), 
then changes the state of the effector for doing an exploration step {e.g. hy- 
pothetical step along the wall), and once again the effector is controlled to 
reach desired measurements {e.g. coming back to the wall but at a displaced 
position). In order to avoid to frequently come back into the same state vec- 
tor, the state vector belonging to a desired measurement is memorized, and 
based on this one specifies constraints for the exploration step, e.g. extrap- 
olating from approximated history. An elementary behavior implemented by 
behavioral module MB\ is included, but no plan in form of a deliberate field 
is used. Instead, information about the relevant course of effector states is 
produced and is represented as a sequence of equilibrium points in a deliber- 
ate field. This is an example of bottom-up flow of information {e.g. course of 
the wall). 
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Module MB^, continued 



6. Change variable state vector according to equation 

S'“(t + 1) := /*^(C(t), S'"(t)), and increment time parameter 
t ;= t -t“ 1. 

7. Go to 2. 

8. Stop. 



Generic Module MBq for Assembled Behavior 

The module is responsible for an assembled behavior which also combines 
deliberate plan execution and visual feedback control. However, in distinction 
to the previous module MB^ now two effectors are involved. The first effector 
is working according to the plan, and the second effector is controlled by 
visual feedback. For example, a vehicle is supposed to move to a certain 
position, which is done by a plan, and simultaneously an agile head-camera 
is supposed to track a by-passing object, which is done by visual feedback 
control. The planned vehicle movement may be useful if the goal position 
is far away and/or does not contain a landmark, and consequently the goal 
position can not be determined from images. For cases like those, the goal 
position must be determined based on task specification. 



Module MBq 



1. Determine relevant type of variable state vectors 
Take deliberate field of iSil into account. 

Determine relevant type of variable state vectors IS 2 I and 
accompanying type of measurements iQl^h 
Initialization of a deliberate field for iS^h 

2. Behavioral module MBi: 

Configure with ( IS 2 I , IQ 2 I )> execution, and return ( ). 

3. Construct an equilibrium in the deliberate field iSfl based on 
current state vector S^it). 

4. Determine current state vector S'/(t) of the relevant effector. 

5. Determine control vector according to equation 

cut) := VFo[{S^,JASl^^^}] (SAt)). 

6. If ( II Cf{t) II < 771 ) then go to 9. 

7. Change variable state vector according to equation 

Si(t + 1) := /*^(G/(t), S'/(t)), and increment time parameter 
t t -t- 1. 

8. Go to 2. 

9. Memorize final deliberate field iSff, and stop. 
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Symbols for Scalars 



i, j, ■■■ 


Indices 


* 


I, J,--- 


Upper bounds of indices 


* 


J J 


Width, height, diagonal of an image 


32 


Xr,yt 


Image and scene coordinates 


32 


Jh 


Width and height of LPT image 


137 


Vl,V2 


Coordinates of LPT image 


137 


r 


Distance of a line from image center 


33 


<t> 


Orientation of a line 


33 


b 


Effective focal length 


47 




Threshold parameters for various purposes 


* 


A. 


Weighting factors for compatibilities 


* 


s 


Factor for stepwise effector movement 


221 


? 


Normalizing factor of projection equation 


245 




Regularization parameter 


112 


c 


Threshold in PAC methodology 


no 


pr 


Probability in PAC methodology 


111 


VA, VB 


Parameters of virtual force vector field 


185 




Phase angle in a complex plane 


49 


J (^i J bi 5 4^1 


Angles for lines and junctions 


* 


b 


Lengths of line segments 


55 


P 


Radius of polar coordinates 


137 


6 


Angle of polar coordinates 


137 


pTnin 


Radius of a foveal image component 


137 


^max 


Radius of a peripheral image component 


137 


Ki 


Half-lengths of axes of hyper-ellipsoid 


150 


0-i,T 


Parameters of a GBF 


112 


Wi 


Scalar weights 


* 


K 


Eigenvalues of covariance matrix for PC A 


118 


Cp 


Radial center frequency of Gabor function 


85 



Symbols for Sets 

V,Vs 



Set of real numbers, m-dim. vector space 
Set or subset of image coordinate tuples 



112 

32 
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-ppl 


Coordinate tuples of peripheral component 


137 


Q 


Set of parameter tuples of lines 


33 


^GBF 


Sample data for GBF networks 


II2 


^PCA 


Sample data for PCA 


118 




Ensemble of seed vectors 


155 




Ensembles of pos/neg validation vectors 


156 


RT-,SCi, 


Testing ensembles for recognition functions 


162 


SH„NS, 


Testing ensembles for recognition functions 


168 


Symbols for 


Sequences 




L 


Sequence of line points 


39 


AB,T,Q 


Sequences of angles 


* 


n 


Sequence of lengths 


55 


Symbols for 


Images 






Gray value image 


32 




Binary image 


33 


1° 


Image of edge orientations 


35 




Log-polar transformed image 


137 


jSH 


Image of standard Hough transformation 


34 


jOH 


Image of orientation-selective HT 


37 




Amplitudes of Gabor function application 


36 




Intermediate images of multi-step processing 


37 


Symbols for 


Vector Fields 




VFa 


Virtual forces attractor vector field 


185 


VFr 


Virtual forces repellor vector field 


185 


VFo 


Virtual forces overall vector field 


185 


Symbols for 


Matrices 




M, C 


Matrices for PCA 


118 


V 


Diagonal matrix of Gaussian extents 


36 


F-4> 


Rotation matrix 


36 


c. 


Covariance matrix of a GBF 


113 


V 


Matrix of activities from GBF network 


113 




Projection matrix, vector/scalar components 


245 


Jf 


Jacobian for projection functions 


* 


Symbols for 


Vectors 




Z 


Input-output vector 


109 


X 


Input vector 


no 
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Y 


Output vector 


no 


B 


Parameter vector 


109 




Mean vector from an ensemble 


* 


U 


Center frequencies of Gabor function 


36 


q,p,Pa,--- 


Image coordinate tuples 


39 


p^ 


Scene coordinate tuples 


* 


Q{t) 


Current measurement vector 


188 


Q* 


Desired measurement vector 


188 


3^= 


Fixed (constant) state vector 


183 


S^t) 


Variable (current) state vector 


183 


C{t) 


Control vector 


184 


QV 


Variable state vector defining an attractor 


184 


S^R 


Variable state vector defining a repellor 


185 




Difference vectors 


150 




Seed vector, validation vector 


155 




Eigenvectors of covariance matrix for PC A 


118 


w 


Weight vector for GBF network 


113 


Symbols for Functions 




/, / 


Explicit functions 


112 


j^im 

’ 


Implicit functions 


109 




Function of polar line representation 


33 


jGs 


Gauss function 


112 


pmg 

J i 


Functions for modifying Gaussians 


150 


/f™ 


Specific, hyper-ellipsoidal Gaussian 


150 


fGh 

Jv,u 


Gabor function 


36 


fph 


Computing mean local phase of line segment 


53 


/r 


Recognition functions 


157 


fts 


Transition function for variable state vector 


184 


jvl jv2 


Computing coordinates of LPT image 


138 


r ’ 


Determine LPT related number of image pixels 


139 


jms 


Measurement function applied to an image 


187 


fct 


Control function generating a control vector 


188 


f\ r' 


Projection functions 


245 


psm 


Functionals 


112 


Symbols for Deviation/Simularity Measures 




Doo 


Distance between angles modulo 180 


39 


Dle 


Orientation-deviation of a line segment 


39 


DpG 


Junction-deviation of a line pencil 


42 


Djp 


Euclidean distance between positions 


42 


Djo 


Deviation between sequences of angles 


42 


DpR 


Similarity between local phases 


53 
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Due Rectangle-deviation of a quadrangle 56 

Dp A Parallelogram-deviation of a quadrangle 56 

Dsq Square-deviation of a quadrangle 56 

DpH Rhombus-deviation of a quadrangle 56 

Dtr Trapezoid-deviation of a quadrangle 56 

Dle.qd Orientation-deviation of a quadrangle 55 

DcC-QD Junction-deviation related to a quadrangle 55 

DpR_QD Phase-similarity of a quadrangle 55 

Dsp-QD Generic measure for quadrangle deviation 56 

DpE.PG Orientation-deviation of a polygon 61 

DcC-PG Junction-deviation of a polygon 61 

DpR_QD Phase-similarity related to a polygon 68 

Dsp-PG Generic measure for polygon deviation 68 

Drs, drs Deviation from reflection-symmetry 66 

DtSj dts Deviation from translation-symmetry 66 

Dra, dr a Deviation from a right-angled polygon 67 

Symbols for Other Measures 

Asp_qd Saliency measure of a specific quadrangle 56 

Asp_pg Saliency measure of a specific polygon 68 

VsL Normalized length variance of line segments 56 

Symbols for Methods 

PEi Generic procedure for quadrangle extraction 57 

PE 2 Generic procedure for polygon extraction 68 

PE^ Generic procedure for polyhedra extraction 78 

PE^ Generic procedure for parallelepiped extraction 81 

CEimn Object recognition with 1-nearest neighbor 162 

CEell Object recognition with ellipsoid approximation 162 

CEegn Object recognition with GBF/ellipsoid network 162 

Mli Instructional modules 198 

MBi Behavioral modules 200 

MMi Monitoring modules 203 

MTi Task-specific modules 212 

Other Symbols 

M-j unction Junction with M converging lines 42 

iS'^l Type of variable state vector 199 

iQl Type of measurement vector 200 

CAi Gamera designations 207 

i Imaginary unit 36 




Index 



Artificial Intelligence, 6 
Aspect graph method, 122 
Autonomous image analysis, 9 

Basis function 

- Gaussian, GBF, 105 

- hyper, HBF, 104 

- hyper-ellipsoidal Gaussian, 105 

- hyper-spherical Gaussian, 105 
~ radial, RBF, 104 
Behavior 

- assembled, 190 

- behavioral bottom-up design, 173 

- behavioral organization, 13 

- contributory, 177 

- elementary, 189 

Camera 

- ceiling, 207 

- hand, 207 

- head, 208 

Canonical frame, CF, 117 
Compatibility, 15 

- appearance, 109 

- geometric/photometric, 32 

- geometric/photometric polygon, 61 

- geometric/photometric quadra., 55 

- line/edge orientation, 35 

- line/edge orientation-, LEOC, 40 

- manifold. 111 

~ parallel/ramp phase, PRPC, 54 

- parallelism, 48 
“ pattern, 103 

- pencil, 77 

- pencil/corner junction, PCJC, 43 

- perception, 8 

- reflection-symmetry, 67 

- right-angle, 67 

- translation-symmetry, 67 

- vanishing-point, 75 



Competence, 12 

- high level, task-solving, 171 
Control 

- function, 188 

- vector, 183 
Controller 

- multi-step, 188 

- one-step, 188 

- proportional-integral, PI, 180 
Corner 

- detector SUSAN, 41 

- gray value, 41 
Corporeality, 12 

Deliberation, 171 
Demonstration 

- Programming by, 179 

- visual, 8 

Design abstractions, 197 
Deviation 

- junction, 42 

- orientation, 39 

- parallelogram, 56 

- rectangle, 56 

- reflection-symmetry, 66 

- rhombus, 56 

- right-angle, 67 

- square, 56 

- translation-symmetry, 66 

- trapezoid, 56 

Dynamic systems theory, 181 

Effector servoing, image-based, 187 
Emergence, 13 
Expert System Shell, 7 
Eye-off-hand system, 10 
Eye-on-hand system, 10 

Features, 19 

- geometric regularity, 27 

- geometric/photometric compat., 27 
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Field 

- attractor vector, 181 

- dynamic vector, 181 
“ repellor vector, 181 
Filter, steerable wedge, 43 

Gabor function 

- local edge orientation, 36 

- local phase, 51 
General Problem Solver, 6 
Generic modules 

- behavioral, 200 

- instructional, 198 

- monitoring, 203 
Gestalt principle, 26 
Ground truths, 26 

Hough 

- image, 33 
“ peak, 33 

- stripe, 48 

Hough transformation, 30 

- orientation-selective, OHT, 37 

- standard, SHT, 34 
“ windowed, 31 

- windowed orien.- select., WOHT, 71 

Instruction 

- assembled, 189 

- elementary, 189 
Invariant, 16 
ISODATA clustering, 48 

Junction, 42 

- of edge sequences, 42 

- of lines, 42 

Karhunen-Loeve expansion, KLE, 104 
Kinematics, forward, 187 

Learning 

- active, 168 

- biased, 9 

- coarse-to-fine, learning, 133 

- PAC-learnability, 108 

- probably approx, correct, PAG, 19 

- Vapnik-Ghervonenkis theory, 103 

- VC-confidence, 128 
Levenberg-Marquardt algorithm, 157 
Lie group theory, 106 

Log-polar transformation, LPT 

- coordinates, 138 

- gray values, 139 

- image, 137 

- pattern, 139 



Manifold, 15 

- feature, 19 

Measurement function, 187 
Measurement vector 

- current, 187 

- desired, 188 
Memory, shared, 177 
Minimum description length, 19 
Monitor, 196 

- exception, 204 

- situation, 203 

- time, 203 

Occam’s Razor, 25 

Partial recovery, 26 
Pencil 

- of lines, 41 

- point, 41 

Perception-action cycle, 190 
Perceptual organization 

- assembly level, 29 

- primitive level, 29 

- signal level, 29 

- structural level, 29 
Phase, local, 49 
Polygon 

- reflection-symmetric, 64 

- right-angled, 67 

- saliency, 68 

- salient, 27 

- translation-symmetric, 65 
Potential 

- field method, 181 

- function, 181 

Projection matrix, perspective, 245 
Purposive visual information, 8 

Quadrangle 

- rectangle, parallelogram, 54 

- saliency, 56 

- square, rhombus, 54 

- trapezoid, 54 
Qualitativeness, 26 
Quasi-invariance, 26 

Recognition 

- discriminability. 111 

- invariance. 111 

- probably approximately correct, 111 

- robustness. 111 
Robot 

- actuator, 10 

- arm, 10 
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“ drive, 10 

- effector, 10 

- gripper, 10 

- head, 10 

- manipulator, 10 

- mobile, 10 

- vehicle, 10 
Robot system, 10 

- autonomous camera-equipped, 14 

- vision-supported, 10 

Scene constituents 

- domestic area, 207 

- inspection area, 207 

- parking area, 207 
Seed 

- apperance, 117 

- image, 117 

- pattern, 117 

- view, 155 
Situatedness, 11 
Situation 

- classified, 20 

- scored, 20 

Sparse approximation, 122 
State 

- transition function, 184 

- vector fixed, 183 

- vector variable, 183 
Strategy 

- coarse-to-fine, recognition, 103 

- global-to-local, localization, 27 
Sub-task 

- assembled, 190 

- elementary, 190 
Support vector, SV 

- machine, 105 

- network, 105 
Symbol grounding, 8 

Task-solving process 

- horizontally organized, 195 

- vertically organized, 193 
Task-specific module 

- generic scheme, 205 

- specific sub-tasks, 208 
Temporal continuity, 147 

Vector field 

- current deliberate, 191 

- deliberate, 189 

- generic deliberate, 190 

- non-stationary deliberate, 190 

- reactive, 189 



View 

- validation, 155 
Virtual forces, 184 

- attractor, 184 

- attractor vector field, 184 

- equilibrium, 184, 186 

- repellor, 184 

- repellor vector field, 184 

- superposition, 184 
Vision 

- animated, 8 

- Computer, 5 

- Robot, 8 

- system, general, 7 
Vision system 

- robot-supported, 10 
Visual servoing 

- continual feedback, 187 

- feature-based, 180 

- position-based, 180 
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